Singhal, Harsh. A new framework of optimizing keyword weights in text categorization and record querying. Retrieved from https://doi.org/doi:10.7282/T3JD4X4P
DescriptionIn text mining research, the Vector Space Model (VSM) has been commonly used to represent text documents as a vector where each component is associated with a particular word in the documents. Assigning appropriate keyword weights in VSM has been critical in Information Retrieval (IR) and Text Categorization (TC).
Traditionally keyword weighting processes are unsupervised; that is, the knowledge of document's category is not leveraged to label the documents. Typically, each keyword weight is assigned using the term frequency -- inverse document frequency (TFIDF) measure. Although the TFIDF measure has been proven effective in several text mining problems, it might not give the optimal classification power for IR and TC. In this thesis, we propose a new optimization framework to find the best keyword weights based on the proposed inter-class and intra-class similarity concept.
The optimal keyword weight can be viewed as the feature space projection where documents from the same category are best clustered together and separated from other categories. Subsequently, the category average (centroid) classification is employed to categorize text documents. The proposed approach is tested on two practical applications: record query and text categorization. The record query application is slightly different from traditional IR problems as the goal is to find correlated (duplicate and master) text records. This problem was initiated by a telecommunication company where service engineers attempt to look for associations of the current defect problem in previously recorded problems in the database. Extensive experiments demonstrate that the proposed framework significantly improves the classification accuracy and provides balanced performance as measured on all text categories when compared to the standard TFIDF search. The text categorization application is tested on the Reuters news data set which is a gold-standard benchmark data set. The results show that our framework improves performance for the two applications considered, namely Information Retrieval and Text Categorization.