Feature which presents terms as a column and

Feature extraction is one of the crucial steps towards text categorization. There are many algorithms and methodologies used for extracting and giving weight to each feature/term. In this part of the thesis, we will discuss the detailed working of the algorithms which will be used for extracting features. Mainly we used Bag of Words, Term Frequency inverse document frequency (tf-idf) approaches for selecting the features/terms. For instance, table 7 presents a corpus which contains two documents. For the exemplary part, we will describe how algorithms mentioned above work by using this corpus.  

Table 7: Sample Corpus which contains two documents

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Document 1: BMW is working on Automotive industry
Document 2: Bosch is working on Electronics industry

 

 

 

 

1.1.1. Bag of words

Bag of words also called term frequency count is the most straight forward algorithm for vectorization. In Bag of words approach, words in the corpus are split into tokens then algorithm loops over each document in the corpus and gets the frequency count of each word. Finally, the algorithm builds the matrix from those frequencies. There are two different types of matrix used for counting the frequency of each term. One is known as a term-document matrix in which terms appear as rows and documents as a column, and the other one is known as document term matrix1 which presents terms as a column and documents as a row. In this thesis, we used document term matrix for inspecting and manipulating the terms.

Example:

Table 8 shows the term frequency matrix for the corpus given in table 7. Table 8 contains word count of two document where the first column contains the document number, and all other column contains the frequency of each term appears in the corresponding document. In this table, ‘1’ represents the single time occurrence of the term in the document and ‘0’ represent the absence of term. The frequency count could be higher if the term occurs more than once. We built table 8 after pre-processing stages, i.e. stop words removal and stemming text during text pre-processing stages.

Table 8: Sample bag of words vector by using Document term matrix

Docs

bmw

bosch

work

automotive

electronic

industry

Doc1

1

0

1

1

0

1

Doc2

0

1

1

0

1

1

 

If we check table 8, we will understand that document ‘1’ consists of terms like; ‘bmw’, ‘work’, ‘automotive’, ‘industry’ and document 2 consist words like; ‘bosch’, ‘work’, ‘electronic’, ‘industry’. Bag of words algorithm takes this preprocessed data as an input which tokenizes documents into words, and these words will be used as an attribute for classification algorithms. Finally, term and document matrix will be computed by iterating over every document and counting the frequencies accordingly. A sample preview of the training documents significant terms with their document frequencies is shown in the figure 19.

Figure 19: Screenshot of the top 20 terms with their frequencies

The term ‘energi’ has a document frequency of 724, because it occurs 724 times in multiple documents. There is a total of 24,000 terms obtained after building bag of words approach. After the collection of the most significant terms along with their document frequencies, the next task is the dimensionality reduction, which is the most tedious task of text categorization.