Simple bag of words

What: Using bag of words to categorize text
Why: Build your own chatbot or classify documents
How: Using scikit-learn and pandas

Introduction

Bag of words is a simple classification approach which looks at the occurrence of (key) words in different classes of documents (the bags). The document which should be classified is assigned to the class, where the best mach is found between the document words and the words within the matching bag.

scikit-learn is a python machine learning library with a very nice concept for handling data from preprocessing to model building: Pipelines.

pandas is a python library which helps storing data in table like objects. It makes the handling of data within python much easier.

The following is inspired by the scikit-learn documentation.

Code

For bag of words, a text has to be tokenized, the words have to be stemmed and a classification has to be build. nltk is used for text processing. The used SnowballStemmer is also able to handle german as long as the german module is downloaded. If you don’t mind the space, you can download all nltk data with:

The code can be tested via the following snippet, which can be embedded as self test in the same script, where the ModelBuilder class is defined.

Instead of english, you can also use ‚german‘ as language but you need different test data. Please note, this is a simple example. For a real world use case you need more categories and examples.

The classifier can output instead of the class probabilities for classes, which may help with determining the quality of the classification in case of data which was not included in the model train data.

Usage

Create your test data or read it from file into a pandas data frame and build the model:

Once this is done, use it to classify unknown documents: