Documents Subject Identification and Clustering based on Subject
Keywords:
Data mining; Text mining; Subject identification; clustering; PDF parseAbstract
With the dramatic growth of textual information over the Internet or databases, there is an increasing need
for the system that can automatically discover useful knowledge from the text. Text Mining is the process of applying
automatic methods to analyze and structure textual data in order to create useable knowledge from previously
unstructured information. Standard text mining techniques of text document usually rely on word matching. This paper
describes how to recognize the subject of each document in the directory and categorizes into related subject directory.
mPDF and PDF parser are the powerful PHP libraries utilized in this work for recognizing the subject. Document
clustering is a technique used to group similar documents. This work proposes a tool for maintaining the large set of
PDF documents and having many applications.