Simple bag of words

What: Using bag of words to categorize text
Why: Build your own chatbot or classify documents
How: Using scikit-learn and pandas

Introduction

Bag of words is a simple classification approach which looks at the occurrence of (key) words in different classes of documents (the bags). The document which should be classified is assigned to the class, where the best mach is found between the document words and the words within the matching bag.

scikit-learn is a python machine learning library with a very nice concept for handling data from preprocessing to model building: Pipelines.

pandas is a python library which helps storing data in table like objects. It makes the handling of data within python much easier.

The following is inspired by the scikit-learn documentation.

Code

For bag of words, a text has to be tokenized, the words have to be stemmed and a classification has to be build. nltk is used for text processing. The used SnowballStemmer is also able to handle german as long as the german module is downloaded. If you don’t mind the space, you can download all nltk data with:

The code can be tested via the following snippet, which can be embedded as self test in the same script, where the ModelBuilder class is defined.

Instead of english, you can also use ‚german‘ as language but you need different test data. Please note, this is a simple example. For a real world use case you need more categories and examples.

The classifier can output instead of the class probabilities for classes, which may help with determining the quality of the classification in case of data which was not included in the model train data.

Usage

Create your test data or read it from file into a pandas data frame and build the model:

Once this is done, use it to classify unknown documents:

German cities list

What: Extract a list of german cities and countries from wikipedia
Why: Get a list of german cities for text processing
How: Using Beautifulsoup, Requests and Python

Introduction

Wikipedia contains a list of german cities and towns. This list is formatted in html and needs to be processed for further automatic processing. Additionally, for each city the country is mentioned.

Code

Below is the python code for extracting the list. The url and the processing of the page specific search via Beautifulsoup is hard encoded. The wikipedia page uses a 2-letter encoding for the countries, which is mapped to the full country name.

The code can be tested via the following snippet, which can be embedded as self test in the same script, where the CityList class is defined.

Usage

Use it from within python:

The output will be something like:

[...,
'Vohenstrauß': 'Bayern',
'Neuötting': 'Bayern',
'Eggenfelden': 'Bayern',
'Gernsheim': 'Hessen',
'Braunsbedra': 'Sachsen-Anhalt',
'Tegernsee': 'Bayern',
...]

Debugging in Jupyter notebook

What: Debugging directly in a Jupyter notebook while it is executed
Why: Faster/more secure development
How: Using ipythons build-in debug functionality

Note: The following is a modified version of the following blog entry: https://kawahara.ca/how-to-debug-a-jupyter-ipython-notebook/.

Insert debug statement

The following line needs to be inserted at the location in a cell where you want to start debugging:

Start debugging

Execute the cell. You will get a debug prompt. It behaves like an ipython shell. The following commands can be used to operate the debugger:

  • q: Quit and stops the programm execution
  • c: Continue to the next breakpoint
  • n: Go to the next line

Happy debugging!

Docker on Raspbian

What: Getting Docker running without hassle on raspberry3
Why: Using Docker images on raspberry
How: Using arm version of docker and standard apt-functionality

This is an extract from the docker documentation, which worked for me on a raspberry3 with raspbian jessie.

Install requirements

Prepare installation

Install

Test

Update Eclipse (from Neon to Oxygen)

What: Updating Eclipse without new installation
Why: Beeing up to date
How: Using update site and Oomph-Settings

Setting up Oomph

The first step is to tell Oomph, which version of Eclipse should be used. Select from the menu: Navigate Open SetupInstallation.

A new tab should open with the installation object. Select it and open properties view. Change the product version of Eclipse in the drop down menu to Oxygen.

Adding Update site for oxygen

The second step involves adding the Oomph update site. Select from the menu: WindowPreferences and open Install/UpdateAvailable Software Sites. Add a new site with the oxygen repository (http://download.eclipse.org/releases/oxygen/).

Click Apply and Close.

Update

Update via the standard Eclipse update mechanism. Select from the menu: HelpCheck for Updates.

Perform the update as normal and restart. The Eclipse version starting should now be Oxygen.

Links

What: A list of not-now useful links
Why: Limited memory 😉
How: –

Tools

Vagrant

Resize vagrant disks

Virtual Box

Resizing virtual box disk space, the easy way
GParted live image

Install guest additions in Ubuntu

Java

https://developers.redhat.com/blog/2017/03/14/java-inside-docker/

Ember

https://medium.com/@ruslanzavacky/ember-cli-fingerprinting-and-dynamic-assets-797a298d8dc6

Docker

https://developers.redhat.com/blog/2017/03/14/java-inside-docker/

Docker an iptables

Sound recording

http://www.upubuntu.com/2013/05/how-to-record-your-voice-from.html

Machine learning

deeplearnjs

Programming

Ternary numbers

Python 3 program to convert a decimal number to ternary (base 3)

Data

General

Pandas data profiling

Conversion

Convert audio to base64

Linux

Go back to older kernel after update

Add a second hard drive (also working for vms)

Ubuntu not starting to graphical mode, flickering

Hang at boot on old machines

See https://www.youtube.com/watch?v=ZZBTSUbzT0g.

Add acpi=force in /etc/default/grub in line GRUB_CMDLINE_LINUX_DEFAULT and apply settings:

Ubuntu

If you would like to use a usb serial connection (for example to connect an ESP32), make sure the linux-modules-extra is installed (which is not the case in default cloud versions of Ubuntu). You can install it via:

See also: https://askubuntu.com/a/1129260.

Windows

Links to user folders

1. Copy folder to new location
2.

Show symbolic links

See also: https://superuser.com/a/496155

VCS

Git

Beautiful log

Use it like:

Development

Joels Test

Hallway test

Python https server

Creating self-signed SSL certificates with OpenSSL

Serial console

https://www.cyberciti.biz/hardware/5-linux-unix-commands-for-connecting-to-the-serial-console/

Client – Server

Get server certificate

See: https://superuser.com/a/641396.

openssl s_client -showcerts -connect server.edu:443 /dev/null|openssl x509 -outform PEM >mycertfile.pem

Systemd

https://wiki.ubuntuusers.de/Howto/systemd_Service_Unit_Beispiel/

https://www.freedesktop.org/software/systemd/man/systemd.service.html

Activate caching for client side on apache2

Enable apache module:

Create .htaccess file in the specific directory (adjust max-age to the number of seconds you want the resource to be cached):

Set different port in apache2

See: https://stackoverflow.com/a/26064554.

Images

Identify image format

For format options see: ImageMagick.

Resize image

IoT

Raspberry and Ubuntu

Install Ubuntu on Raspberry

https://ubuntu.com/download/raspberry-pi

Getting audio to work over hdmi

Add the following line to /boot/firmware/usercfg.txt and reboot:

Increase gpu memory

Add the following line to /boot/firmware/usercfg.txt and reboot:

Mosquittos on the couch

What: Put mosquitto messages to the couch
Why: Using mosquitto broker as relay between your IoT devices and a database backend
How: Use mosquitto, curl and some linux magic

Requirements

You need couchdb (assumming it runs locally on port 5984 for this example) and mosquitto (also assuming it runs locally for this example). If you dont have it on your system, have a look at my other blog entry. Additionally, you need curl and the bash.

Set up a simple publisher

Create a simple script test.sh, which will publish messages periodically to the mosquitto broker under the topic test:

Change the permission for this script in such a way that you can execut it.

Create a test database

Connect mosquitto and couch via curl

Mosquitto and couchdb can be connected via a simple shell pipe:

Note: You could think about piping mosquitto directly to couch, if your message is already a json string. Something like this:

This will not work, because curl starts reading the input after it is complete (after the stream from mosquitto is closed). You need the while read line construction like shown above.

Run the test publisher script and verify results

Run the script:

Wait some seconds. Now query the database and you should have some documents there:

The result should look like:

Make a small animation based on existing images

What: Concatenate images to a video
Why: Small animation as logo, simple stop and go movie
How: Using ffmpeg to concat existing images into mp4 file

Requirements

The following is using ffmpeg in debian linux. There are also builds for other platforms available. Depending on the platform you maybe have to install the codec. The following is tested with a default debian system.

Install ffmpeg with:

Create some images

Create a sequence of images. Lets assume we want to have a spinning wheel of the following kind for this tutorial:










The important point is the numerical ordering of the image names like img1.png, img2.png, ….

Create the video

Create the video according to the documentation (assuming an image numbering like mentioned above):

The above command works because there are less than 10 images. If you have another image numbering (for example 001-999) you have to change the pattern:
– img%02d.png for 01-99
– img%03d.png for 001-999
– …

The final result will look like:

Useful ansible roles

What: Using Ansible to setup a development system with Couchdb and Docker
Why
: Having a phoenix like dev setup
How: Using Ansible and some simple roles to provision the system

Requirements

You need a system, where Ansible is installed on. In case you don’t have it at hand, you can use the following Vagrantfile to set it up:

Preparing the playbook

Lets set up a simple playbook. Because something is installed, become is needed to install as root. Create a file called playbook.yml with the following content (there are more roles in the repository, but these should be enough for the beginning):

The roles

Clone the following git repository and change to the directory usefulansibleroles. Copy the roles-folder next to your playbook file.

Note: The install-couch role will install couchdb via docker (see https://hub.docker.com/r/klaemo/couchdb/) in version 2.0. Docker will be setup to restart couchdb at every boot.

Run playbook

Run the playbook. You can use a hosts file at /etc/ansible/hosts or run it locally:

Test

Connect to the provisioned machine. The following commands should give you correct results:

Using weather forecast data of DWD for Europe

What: Extracting weather forecast data from DWD grib2 files with Java
Why: Using weather forecasts in your (Java) applications
How: Download the data from DWD and using NetCDF for parsing

Getting the data

The data (in the following, the 3-day forecast is used) is freely available via ftp from here. You have to register with a valid EMail.

There are several different data sets on the server. The interesting one is the ICON model. It contains forecasts with 0.125 and 0.25 degree resolution for wind, temperature, precipitation and more. You find the data under the path /ICON/grib/europe.

There is a list of available content for the ftp server here.

The data is published in 6hour intervalls at 5, 11, 17 and 23 o’clock. Each file is a zipped grib2 file.

For this tutorial, download and unzip the file /ICON/grib/europe/ICON_GDS_europe_reg_0.250×0.250_T_2M_[yyyyMMddhh].grib2.bz2 (replace the date) which contains the temperature 2 meters above ground and unzip it.

Parsing the data

You can find the full example here. Clone the repository, change to weatherforecastdata directory and adapt it to your needs (hard coded file name, …). After you have finished your changes run:

If you want to build it from scratch (it is simple) create a maven project and add the following repository and dependency to your pom.xml:

NetCDF is able to read in the grib files. You can create a new instance of the data file via:

Each file contains dimensions, attributes and variables. The variables depend on the dimensions. For example: The temperature value depends on the latitude and longitude as well as on time and height above ground. The data can be retrieved in arrays, with the shape depending on the corresponding dimensions. Latitude, longitude, time and height above ground are 1-dimensional arrays while the temperature depends on all of them and thus is 4-dimensional.

You can get dimensions, attributes and variables via:

Lets concentrate on the variables. For each variable you can get name, units and dimensions it depends on:

For the example above this will give for the temperature:

There is a variable called Temperature_height_above_ground, which depends on time (there are 79 different values for the time dimension), height_above_ground (with just one different value because we look at the temperature measured at 2m above ground), latitude and longitude with (301/601 different values for the respective dimension).

This is enough information to retreive the data from the file:

Iterating over the temperature values can now be done by iterating over each of the dimensions:

Have fun with the amazing DWD source of weather forecast data.

Chess in space, wondrous stuff and IT