A New Hybrid Model of K-Means and Naïve Bayes Algorithms for Feature Selection in Text Documents Categorization



Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmia, Iran


With increasing speed of information and documents on the Web, need to classify them in different categories and clusters to be felt. Clustering try to find related structures in datasets which they are not categorized, yet. Concerning the needs, a new approach for text documents categorization is presented in this paper which included three phases: pre-processing documents and selection feature, K-Means clustering and Naïve Bayes (NB) optimization. The proposed model uses K-Means and NB algorithms that utilize K-Means algorithm to find minimum distances between features from center of clusters and NB algorithm for computing the probability of each feature into documents and using them to clustering features, separately. The proposed model optimizes performance of K-Means algorithm by using NB properties in clustering. Therefore, the model overcomes to the challenges of labeling different documents and origin of K-Means algorithm which it refers to categorizing text documents as un-supervised model. Finally, the experiment results of proposed algorithm and K-Means algorithms are evaluated based on evaluation methods and are compared in validated datasets.