What: Using bag of words to categorize text
Why: Build your own chatbot or classify documents
How: Using scikit-learn and pandas
Introduction
Bag of words is a simple classification approach which looks at the occurrence of (key) words in different classes of documents (the bags). The document which should be classified is assigned to the class, where the best mach is found between the document words and the words within the matching bag.
scikit-learn is a python machine learning library with a very nice concept for handling data from preprocessing to model building: Pipelines.
pandas is a python library which helps storing data in table like objects. It makes the handling of data within python much easier.
The following is inspired by the scikit-learn documentation.
Code
For bag of words, a text has to be tokenized, the words have to be stemmed and a classification has to be build. nltk is used for text processing. The used SnowballStemmer is also able to handle german as long as the german module is downloaded. If you don’t mind the space, you can download all nltk data with:
sudo python3 -m nltk.downloader all
import nltk from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import GridSearchCV from sklearn.linear_model import SGDClassifier from sklearn.pipeline import Pipeline import pandas import logging import warnings # Supress some sklearn warnings warnings.filterwarnings("ignore", category=FutureWarning) class ModelBuilder(): __stemmer=None def __init__(self, language): self.__stemmer=nltk.stem.SnowballStemmer(language) logging.basicConfig(format='%(asctime)-15s %(message)s') self.__logger=logging.getLogger('ModelBuilder') # Taken from: https://stackoverflow.com/q/26126442 def __stem_tokens(self, tokens, stemmer): stemmed = [] for item in tokens: stemmed.append(stemmer.stem(item)) return stemmed # Taken from: https://stackoverflow.com/q/26126442 def __tokenize(self, text): tokens = nltk.tokenize.WordPunctTokenizer().tokenize(text) stems = stem_tokens(tokens, stemmer) return stems def buildModel(self, data): # Taken from: http://scikit-learn.org/stable/auto_examples/model_selection/ # grid_search_text_feature_extraction.html# # sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()), ]) parameters = { 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams 'clf__alpha': (0.00001, 0.000001), 'clf__penalty': ('l2', 'elasticnet'), 'clf__max_iter': (1000,), 'clf__tol': (1e-3,), # Modified huber allows getting probabilities for the classes out. # See predict_proba for details 'clf__loss':('modified_huber',) } # find the best parameters for both the feature extraction and the # classifier grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=False) grid_search.fit(data.Text, data.Class) self.__logger.info("Best score: %0.3f" % grid_search.best_score_) self.__logger.info("Best parameters set:") best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): self.__logger.info("\t%s: %r" % (param_name, best_parameters[param_name])) return grid_search.best_estimator_
The code can be tested via the following snippet, which can be embedded as self test in the same script, where the ModelBuilder class is defined.
import unittest class TestModelBuilder(unittest.TestCase): def setUp(self): self.__out=ModelBuilder('english') self.__testdata=pandas.DataFrame(columns=['Text', 'Class']) self.__testdata.loc[self.__testdata.shape[0]]=["Hello", "greeting"] self.__testdata.loc[self.__testdata.shape[0]]=["Hi", "greeting"] self.__testdata.loc[self.__testdata.shape[0]]=["How are you", "greeting"] self.__testdata.loc[self.__testdata.shape[0]]=["Bye", "farewell"] self.__testdata.loc[self.__testdata.shape[0]]=["Goodbye", "farewell"] self.__testdata.loc[self.__testdata.shape[0]]=["See you", "farewell"] def test_buildModel(self): classifier=self.__out.buildModel(data) classifier.predict(["See you"]) self.assertEqual('farewell', classifier.predict(["See you"])) self.assertEqual('greeting', classifier.predict(["Hello"])) suite = unittest.TestLoader().loadTestsFromTestCase(TestModelBuilder) unittest.TextTestRunner().run(suite)
Instead of english, you can also use ‚german‘ as language but you need different test data. Please note, this is a simple example. For a real world use case you need more categories and examples.
The classifier can output instead of the class probabilities for classes, which may help with determining the quality of the classification in case of data which was not included in the model train data.
Usage
Create your test data or read it from file into a pandas data frame and build the model:
classifier=ModelBuilder(<language>).buildModel(<data>)
Once this is done, use it to classify unknown documents:
documentclass=classifier.predict([<text>]) documentclassProbabilities=classifier.predict_proba([<text>])