Simple bag of words

What: Using bag of words to categorize text
Why: Build your own chatbot or classify documents
How: Using scikit-learn and pandas

Introduction

Bag of words is a simple classification approach which looks at the occurrence of (key) words in different classes of documents (the bags). The document which should be classified is assigned to the class, where the best mach is found between the document words and the words within the matching bag.

scikit-learn is a python machine learning library with a very nice concept for handling data from preprocessing to model building: Pipelines.

pandas is a python library which helps storing data in table like objects. It makes the handling of data within python much easier.

The following is inspired by the scikit-learn documentation.

Code

For bag of words, a text has to be tokenized, the words have to be stemmed and a classification has to be build. nltk is used for text processing. The used SnowballStemmer is also able to handle german as long as the german module is downloaded. If you don’t mind the space, you can download all nltk data with:

sudo python3 -m nltk.downloader all
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import pandas
import logging
import warnings
# Supress some sklearn warnings
warnings.filterwarnings("ignore", category=FutureWarning)

class ModelBuilder():
    __stemmer=None
    
    def __init__(self, language):
        self.__stemmer=nltk.stem.SnowballStemmer(language)
        logging.basicConfig(format='%(asctime)-15s %(message)s')
        self.__logger=logging.getLogger('ModelBuilder')
    
    # Taken from: https://stackoverflow.com/q/26126442
    def __stem_tokens(self, tokens, stemmer):
        stemmed = []
        for item in tokens:
            stemmed.append(stemmer.stem(item))
        return stemmed

    # Taken from: https://stackoverflow.com/q/26126442
    def __tokenize(self, text):
        tokens = nltk.tokenize.WordPunctTokenizer().tokenize(text)
        stems = stem_tokens(tokens, stemmer)
        return stems
    
    def buildModel(self, data):
        # Taken from: http://scikit-learn.org/stable/auto_examples/model_selection/
        # grid_search_text_feature_extraction.html#
        # sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
        pipeline = Pipeline([
            ('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', SGDClassifier()),
        ])

        parameters = {
            'vect__max_df': (0.5, 0.75, 1.0),
            'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
            'clf__alpha': (0.00001, 0.000001),
            'clf__penalty': ('l2', 'elasticnet'),
            'clf__max_iter': (1000,),
            'clf__tol': (1e-3,),
            # Modified huber allows getting probabilities for the classes out. 
            # See predict_proba for details
            'clf__loss':('modified_huber',)
        }

        # find the best parameters for both the feature extraction and the
        # classifier
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=False)

        grid_search.fit(data.Text, data.Class)
        self.__logger.info("Best score: %0.3f" % grid_search.best_score_)
        self.__logger.info("Best parameters set:")
        best_parameters = grid_search.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            self.__logger.info("\t%s: %r" % (param_name, best_parameters[param_name]))

        return grid_search.best_estimator_

The code can be tested via the following snippet, which can be embedded as self test in the same script, where the ModelBuilder class is defined.

import unittest
 
class TestModelBuilder(unittest.TestCase):
    
    def setUp(self):
        self.__out=ModelBuilder('english')
        
        self.__testdata=pandas.DataFrame(columns=['Text', 'Class'])
        self.__testdata.loc[self.__testdata.shape[0]]=["Hello", "greeting"]
        self.__testdata.loc[self.__testdata.shape[0]]=["Hi", "greeting"]
        self.__testdata.loc[self.__testdata.shape[0]]=["How are you", "greeting"]
        self.__testdata.loc[self.__testdata.shape[0]]=["Bye", "farewell"]
        self.__testdata.loc[self.__testdata.shape[0]]=["Goodbye", "farewell"]
        self.__testdata.loc[self.__testdata.shape[0]]=["See you", "farewell"]
 
    def test_buildModel(self):
        classifier=self.__out.buildModel(data)
        classifier.predict(["See you"])
        self.assertEqual('farewell', classifier.predict(["See you"]))
        self.assertEqual('greeting', classifier.predict(["Hello"]))
 
suite = unittest.TestLoader().loadTestsFromTestCase(TestModelBuilder)
unittest.TextTestRunner().run(suite)

Instead of english, you can also use ‚german‘ as language but you need different test data. Please note, this is a simple example. For a real world use case you need more categories and examples.

The classifier can output instead of the class probabilities for classes, which may help with determining the quality of the classification in case of data which was not included in the model train data.

Usage

Create your test data or read it from file into a pandas data frame and build the model:

classifier=ModelBuilder(<language>).buildModel(<data>)

Once this is done, use it to classify unknown documents:

documentclass=classifier.predict([<text>])
documentclassProbabilities=classifier.predict_proba([<text>])