Archiv der Kategorie: R

Drake, drip & Data science

What: Analysing data with a data workflow, fast Java startup & csv Magic
Why: Building data analysis pipelines for (small) problems, where intermediate steps are automatically documented
How: Use Drake for (data) workflow management, Drip for fast JVM startup and csvkit for csv magic.

In this post, I will show you how to build a small data analysis pipeline for analysing the references of a wikipedia article (about data science). The result is the following simple image, all steps are automated and intermediate results are documented automatically.

result.plot

In the end, you should have four artifacts documenting your work:

  • Drakefile: The workflow file, with which you ca regenerate all other artifacts. Plain text, thus can be easily used with version control systems.
  • data.collect: The html content of the wikipedia article as the source of the analysis
  • data.extract: The publishing years of the references with number of occurrence
  • result.plot.png: A png of the publishing year histogram

Agenda

  1. Install the requirements
  2. Build the pipeline
  3. Run the pipeline

Install the requirements

You need the following tools:

  • Linux (Debian or Ubuntu, command line tools)
  • Python (for html processing and csv magic)
  • R (for plotting)
  • Java (for Drake)

You can install the dependencies easily with the script below. The following steps are tested within a Debian (Jessie) VM, 64bit. It should also work on Ubuntu. Maybe, other distros have to be adapted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Update
sudo apt-get update
 
# Install curl
sudo apt-get install -y curl
 
# Install R
sudo apt-get install -y r-base
sudo apt-get install -y r-cran-rcpp
sudo Rscript -e 'install.packages("ggplot2", repos="https://cran.uni-muenster.de/")'
 
# Install Java8
sudo sh -c 'echo deb http://ftp.de.debian.org/debian jessie-backports main >> /etc/apt/sources.list'
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk
 
# Install Drip
mkdir ~/.lib
git clone https://github.com/flatland/drip.git ~/.lib/drip
cd ~/.lib/drip
make prefix=~/.bin install
 
# Download & Setup Drake
wget https://github.com/Factual/drake/releases/download/1.0.3/drake.jar -O ~/.lib/drake.jar
cat << 'EOF' > ~/.bin/drake
#!/bin/bash
drip -cp ~/.lib/drake.jar drake.core "$@"
EOF
chmod u+x ~/.bin/drake
echo export PATH=~/.bin:$PATH >> ~/.bashrc
 
# Install csvkit
pip install csvkit

Build the pipeline

Drake is controlled by a so called Drakefile. Let us define three steps for the data processing:

  1. Data collection (html from Wikipedia)
  2. Data extraction (Extracting the reference texts and years of the references)
  3. Plotting results (Plotting the results to png)

1. Data collection

The first step can be done with Linux internal tools. Thus, we can create the Drakefile with the first step already:

data.collect <- [-timecheck]
  curl -k https://de.wikipedia.org/wiki/Data_Science > $OUTPUT

Drake takes input (mostly) from files and sends the output of each step again to files. Thus, the result of each step in a workflow is automatically documented.

This first step will download the html for the data science article from the german wikipedia and stores the html in a file called data.collect (the [-timecheck] avoids running the step each time drake is started because of missing input from previous steps). You can already run this workflow with the drake command:

drake

This will generate a file called data.collect containing the html of the wikipedia page.

2. Data extraction

For data extraction from html, Python & BeautifulSoup is used. The extraction of the year from each reference can be done with linux internal tools (for example grep). Thus, the python program should read from stdin, get the reference text and outputs plain text to stdout. Create a file called extract.py with the following content:

1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env python3
 
from bs4 import BeautifulSoup
import sys
 
input=sys.stdin.read()
soup=BeautifulSoup(input, 'html.parser')
 
entries=soup.find('ol', {'class': 'references'}).findAll('li')
 
for entry in entries:
    print(entry.getText())

Make the script file executable with:

chmod u+x ./extract.py

You can test the script with the following command (If you ran the workflow from step 1 before):

cat data.collect | ./extract.py

Now, let us extend the Drakefile to use the python script, search for years with a regex, create a histogram by counting the occurrences of the years, reorder the columns and add a header:

data.collect <- [-timecheck]
  curl -k https://de.wikipedia.org/wiki/Data_Science > $OUTPUT

data.extract <- data.collect
  cat $INPUT | ./extract.py | grep -o -P '\b\d{4}\b' | sort | uniq -c | sort -nr | \
    sed 's/^[ ]*//g' | csvcut -c2,1 -d " " | printf 'year, occ'"\n$(cat)" > $OUTPUT

If you run this workflow with:

drake

a new file called data.extract will be created which looks like the following:

year, occ
2015,15
2016,5
2013,5
2014,4
1997,3
2003,2
2002,2
2012,1
1145,1

Please note the wrongly detected date 1145.

You can filter stupid dates out with csvsql (I really recommend this tool) from the csvkit tool suite by extending the Drakefile with some simple sql:

data.collect <- [-timecheck]
  curl -k https://de.wikipedia.org/wiki/Data_Science > $OUTPUT

data.extract <- data.collect
  cat $INPUT | ./extract.py | grep -o -P '\b\d{4}\b' | sort | uniq -c | sort -nr | \
    sed 's/^[ ]*//g' | csvcut -c2,1 -d " " | printf 'year, occ'"\n$(cat)" | \
    csvsql --query 'select * from stdin where year>1900 and year<2100' > $OUTPUT

Plotting results

The final step is the plotting of the results. Let us create a R file, which reads from stdin and plots data to png. Create the file plot.R with the following content:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/usr/bin/env Rscript
 
library(ggplot2)
 
data <- read.table(pipe('cat /dev/stdin'), header=T, sep=",")
data[,1]<-factor(data[,1])
names<-names(data)
 
p<-ggplot(data, aes(x=data[, 1], y=data[, 2]))+geom_bar(stat='identity')+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
    xlab(names[1])+ylab(names[2])+theme(text = element_text(size=8))
args = commandArgs(trailingOnly=TRUE)
if(length(args)>1){
    t<-paste(strwrap(args[2], width=40), collapse = "\n")
    p+ggtitle(t)
}
ggsave(args[1], width = 2*1.618, height = 2)

Make the script file executable with:

chmod u+x ./plot.R

Now extend the Drakefile with the last step: Image creation. If Drake is run again, it is checking if the output of steps is already up to date by checking for files with the step name. Thus, the image file name and the step should match:

data.collect <- [-timecheck]
  curl -k https://de.wikipedia.org/wiki/Data_Science > $OUTPUT

data.extract <- data.collect
  cat $INPUT | ./extract.py | grep -o -P '\b\d{4}\b' | sort | uniq -c | sort -nr | \
    sed 's/^[ ]*//g' | csvcut -c2,1 -d " " | printf 'year, occ'"\n$(cat)" | \
    csvsql --query 'select * from stdin where year>1900 and year<2100' > $OUTPUT

result.plot <- data.extract
  cat $INPUT | ./plot.R $OUTPUT \
    "Histogram of reference publishing dates for the Wikipedia data science article"

Again, run this step by executing the drake command.

Run the pipeline

Finally, to see all steps working together delete the generated artifacts (data.collect, data.extract, result.plot.png) and run drake again:

vagrant@debian-jessie:/tmp$ drake
The following steps will be run, in order:
  1: /tmp/././data.collect <-  [missing output]
  2: /tmp/././data.extract <- /tmp/././data.collect [projected timestamped]
  3: /tmp/././result.plot.png <- /tmp/././data.extract [projected timestamped]
Confirm? [y/n] y
Running 3 steps with concurrence of 1...
 
--- 0. Running (missing output): /tmp/././data.collect <-
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 58980    0 58980    0     0   172k      0 --:--:-- --:--:-- --:--:--  196k
--- 0: /tmp/././data.collect <-  -> done in 0.54s
 
--- 1. Running (missing output): /tmp/././data.extract <- /tmp/././data.collect
--- 1: /tmp/././data.extract <- /tmp/././data.collect -> done in 0.28s
 
--- 2. Running (missing output): /tmp/././result.plot.png <- /tmp/././data.extract
--- 2: /tmp/././result.plot.png <- /tmp/././data.extract -> done in 0.75s
Done (3 steps run).

You should now have three files for each of the workflow steps, where the last step file contains the image shown above.

R: Fast grouping

What: Fast calculation on R data structures
Why: Performance
How: Use data tables instead of data frames

Some days ago I struggled with performance topics on R. Task was the analysis of a larger text corpus (~600MB english text). The preprocessing was finally done in Java (made the difference between unusable (R) and quiet fast (Java)). The analysis itself was possible in R after switching to data tables instead of data frames.

The followig snippet shows an example of calculating the mean on subgroups of the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
library(data.table)
 
n<-10000000
test<-rnorm(n)
fac<-sample(1:5, n, replace = T)
 
df<-data.frame(col1=test, fac=fac)
dt<-data.table(col1=test, fac=fac)
 
f_df<-function(){
  aggregate(col1 ~ fac, dt, mean)
}
f_df()
system.time(f_df())
 
f_d<-function(){
  dt[, mean(col1), by=fac]
}
f_dt()
system.time(f_dt())

The results are:

Data structure Time [s]
Data frame
User      System verstrichen 
8.13        0.97        9.09
Data table
User      System verstrichen 
0.12        0.01        0.14

Essentially, this is a factor of 80 on my machine (Win8.1, 8GB, i5).

Where to go next: Read this interesting blog post regarding data tables and performance. Also have a look at the fread function for reading in data. Also, have a look at other nice features of data tables like keys or assignment of columns.

Pythagorean triples: Do it right

What: Minimal lines of code for calculating the length of integer-sided right triangles with a side length below a given threshold
Why: Functional programming paradigm and vector handling in different languages
How: Write minimal examples for: Frege, Java, SQL, R, Python, Javascript. Please contribute!

Last week I went to a talk, where Frege was introduced. Frege is a purely functinal language based on Haskel. I once looked at Haskell and the introductory example was the famous pythogorean triples. Thats also mentioned on the Frege page. I was asking myself: How can this be done in Java, or R or SQL?

Here is my list of implementations. Please contribute if you know more or have a better (shorter) version. Line breaks are inserted for better layout. All implementations return something like:

(3, 4, 5)
(4, 3, 5)
(6, 8, 10)
(8, 6, 10)

Frege

This is not tested. I am not sure, what Frege says about the inserted line breaks.

1
2
3
4
5
[ (a,b,c) |
  a <- [1..10],
  b <- [x..10],
  c <- [x..10],
a*a + b*b == c*c ]

Java

Tested.

For the Java version: Lets create a model class first. This makes the stream handling more easy. It is just some sugar.

1
2
3
4
5
6
7
8
9
10
11
12
static class Triple {
  final int a, b, c;
 
  public Triple(int a, int b, int c) {
    this.a=a;this.b=b;this.c=c;
  }
 
  @Override
  public String toString() {
    return "Triple [a=" + a + ", b=" + b + ", c=" + c + "]";
  }
}

Now, lets write the logic:

1
2
3
4
5
6
7
  IntStream intStream = IntStream.range(0, 1000);
  intStream.boxed().map(number -> new Triple(
    (number/100)%10+1,
    (number/10)%10+1,
    (number/1)%10+1)).
  filter(triple -> Math.pow(triple.a, 2)+Math.pow(triple.b, 2)==Math.pow(triple.c, 2)).
  forEach(triple -> System.out.println(triple));

SQL (Oracle)

Tested.

1
2
3
4
5
SELECT a, b, c FROM
  (SELECT Level AS a FROM Dual CONNECT BY Level <=10),
  (SELECT Level AS b FROM Dual CONNECT BY Level <=10),
  (SELECT Level AS c FROM Dual CONNECT BY Level <=10)
WHERE POWER(a, 2)+POWER(b, 2)=POWER(c, 2)

R

Tested.

1
2
3
df=data.frame(a=1:10, b=1:10, c=1:10)
expanded=expand.grid(df)
subset(expanded, a**2+b**2==c**2)

Python (3)

Tested.

1
2
3
4
5
6
import itertools
 
triples = [range(1, 11), range(1, 11), range(1, 11)]
valid=filter(
  lambda t: t[0]**2+t[1]**2==t[2]**2, list(itertools.product(*triples)))
print(*valid, sep="\n")

Javascript

Tested.

Creation of filled arrays: See here.

Integer division: See here.

1
2
3
4
5
6
7
8
var numbers=Array.apply(null, {length: 1000}).map(Number.call, Number);
var triples=numbers.map(function(n){
  return {a: ~~(n/100)%10+1, b: ~~(n/10)%10+1, c: ~~(n/1)%10+1}
});
var valid=triples.filter(function(t){
  return Math.pow(t.a,2)+Math.pow(t.b,2)==Math.pow(t.c,2)
});
console.log(valid);

Fly me to the stars: Interactive graphs with Docker, Jupyter and R

What: Documented data analysis on the fly
Why: Keep documentation, implementation and operation in one document; create publication ready reports; explore data interactively
How: Use Docker (containerization), Jupyter (documenting and code execution), R (analysis) and Plotly (interactive plots, very interesting also for web-frontend Javascript graphs)

You have to install Docker to run this tutorial. All dependencies are automatically installed in the container. Otherwise, you have to install the dependencies manually. Additionally, you need a browser to get access to Jupyter. The interactive plots work best in Chrome. I had some issues running it with firefox.

Copy the following content in a file called Dockerfile:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
FROM ubuntu:xenial
 
RUN apt-get update
RUN apt-get install -y wget
RUN apt-get install -y bzip2
 
# Install python
RUN wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda3-4.0.0-Linux-x86_64.sh && bash Anaconda3-4.0.0-Linux-x86_64.sh -b -p $HOME/anaconda
 
# Install R
RUN apt-get install -y r-base
RUN apt-get install -y r-recommended
 
# Install requirements for Jupyter
RUN apt-get install -y libzmq3-dev
# See http://askubuntu.com/a/628288
RUN apt-get install -y libcurl4-openssl-dev
RUN apt-get install -y libssl-dev
RUN apt-get install -y inkscape
RUN apt-get install -y pandoc
RUN apt-get install -y texlive-latex-base
RUN apt-get install -y texlive-latex-extra
 
# Prepare R and Jupyter.
ADD installJupyter.R /root/installJupyter.R
RUN export PATH="$HOME/anaconda/bin:$PATH" && R CMD BATCH /root/installJupyter.R
RUN export PATH="$HOME/anaconda/bin:$PATH" && jupyter notebook --generate-config
# Allow all clients to connect. THIS IS NOT SECURE!
RUN sed -i s/.*NotebookApp\.ip.*localhost.*/c.NotebookApp.ip=\'*\'/g ~/.jupyter/jupyter_notebook_config.py
RUN mkdir /root/jupyternotebooks
 
# Set the locale (german)
RUN echo >> /etc/locale.gen de_DE.UTF-8 UTF-8
RUN echo >> /etc/locale.gen de_DE ISO-8859-1
RUN echo >> /etc/locale.gen de_DE@euro ISO-8859-15
RUN locale-gen
RUN echo "Europe/Berlin" | tee /etc/timezone
RUN dpkg-reconfigure --frontend noninteractive tzdata
 
# Taken from: http://jaredmarkell.com/docker-and-locales/ and adapted
#RUN apt-get update && apt-get install -y language-pack-de-base && locale-gen de_DE && echo >> ~.profile export LANG=de_DE.ISO-8859-1
ENV LANG de_DE.iso88591
ENV LANGUAGE de_DE.iso88591
ENV LC_ALL de_DE.iso88591
 
# Start Jupyter on port 8888
EXPOSE 8888
CMD export PATH="$HOME/anaconda/bin:$PATH" && cd /root/jupyternotebooks && jupyter notebook --no-browser

In the same directory, you need the file installJupyter.R with the following content:

1
2
3
4
5
6
7
8
9
10
install.packages("digest", repos="https://cran.uni-muenster.de/")
install.packages("uuid", repos="https://cran.uni-muenster.de/")
install.packages("jsonlite", repos="https://cran.uni-muenster.de/")
install.packages("base64enc", repos="https://cran.uni-muenster.de/")
install.packages("evaluate", repos="https://cran.uni-muenster.de/")
install.packages("dplyr", repos="https://cran.uni-muenster.de/")
install.packages("stringr", repos="https://cran.uni-muenster.de/")
install.packages("plotly", repos="https://cran.uni-muenster.de/")
install.packages(c("rzmq","repr","IRkernel","IRdisplay"), repos = c("http://irkernel.github.io/", "https://cran.uni-muenster.de/"), type = "source")
IRkernel::installspec()

Now, run the following two docker commands in the same directory where the docker file is located:

1
2
docker build -t docker-jupyter .
docker run --rm -v <Path to a folder for the jupyter notebooks>:/root/jupyternotebooks -p 8888:8888 docker-jupyter

You should now have the running docker instance available on the docker VM, port 8888:
screenshot_jupyter

Via the „Upload“ button, upload the Notebook from here: Interactive graphs with Jupyter and R and open it. Execute all steps via „Cell“ -> „Run all“. Now, it should look like:

screenshot_jupyter_plotly

Did you noticed the interactive plot? You can zoom, export as png, have tooltips, … All without anything to programm. Cool, isn’t it?

Ok. Your customer is impressed but does not like the modern html stuff? He wants to print it out and send it away via mail? No problem. Just change the interactive plots to none-interactive ones and export to pdf via „File“ -> „Download as“ -> „PDF via Latex“ (that ’s why the docker file above contains all the stuff like pandoc, latex, …). You will get nearly publication ready report out.

Win a card game against your kids with OCR and statistics

What: OCR & R; Analyze standardized hardcopy forms electronically
Why: Win a card game with a lot of cards to remember (car quartett)
How

You need:

  • The card game
  • A scanner
  • Gimp
  • A linux machine
  • An hour free time

1. Setup a virtual machine or an existing linux

I used Ubuntu Xenial64bit. Maybe, you have to adapt the steps a little bit.

  1. Install tesseract
    1. 1
      
      sudo apt-get update
    2. 1
      
      sudo apt-get install -y tesseract-ocr tesseract-ocr-deu
  2. Install Java
    1. 1
      
      sudo apt-get install -y openjdk-8-jdk

2. Scan the cards

Scan all the cards. Make one image per card. Make sure, you placed the cards all the time at the same position in the scanner (for example upper right corner). All images should have the same resolution and size. Thus, the same regions on the image should correspond to the same region in all the cards (for example a name field). The cards can look like this (you can guess what game I played ;-)):

You have to tweak the images a little bit to get good OCR results. Here is what I did:

  1. Use Gimp to blur the images (Filter ⇒ Blur ⇒ Blur)

3. Basic image processing with Java & tesseract

The cards I scanned had some defined regions with numerical or text values (see figure above). You can enhance the OCR results dramatically, if you know ehere to look for the text. Create a small class containing the information about each region. This class should also contain a flag if you look for text or numerical values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class ImageInfo {
  public final String name;
  public final boolean number;
  public final int leftUpX;
  public final int leftUpY;
  public final int rightDownX;
  public final int rightDownY;
 
  public ImageInfo(String name, boolean number, int leftUpX, int leftUpY, int rightDownX, int rightDownY) {
    this.name = name;
    this.number = number;
    this.leftUpX = leftUpX;
    this.leftUpY = leftUpY;
    this.rightDownX = rightDownX;
    this.rightDownY = rightDownY;
  }
}

Set up the regions for your image as you like. Thats what I used for the cards (coordinates are pixels in the scanned image).

1
2
3
4
5
6
7
8
9
final static List infos = Arrays.asList(new ImageInfo[] {
    new ImageInfo("Name", false, 160, 158, 460, 40),
    new ImageInfo("Geschwindigkeit", true, 200, 685, 70, 35),
    new ImageInfo("Hubraum", true, 460, 685, 110, 40),
    new ImageInfo("Gewicht", true, 180, 790, 110, 40),
    new ImageInfo("Zylinder", true, 475, 790, 60, 40),
    new ImageInfo("Leistung", true, 200, 895, 70, 40),
    new ImageInfo("Umdrehung", true, 470, 895, 90, 40)
  });

Now, process each image. For each image I did the following steps:

  1. Read in the image
    1
    
    BufferedImage image= ImageIO.read(imageFile);
  2. Transformed image to black/white ⇒ Better OCR results
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    int width = image.getWidth();
    int height = image.getHeight();
    for (int y = 0; y &lt; height; y++) {
      for (int x = 0; x &lt; width; x++) { int rgb = image.getRGB(x, y); int blue = rgb &amp; 0xff; int green = (rgb &amp; 0xff00) &gt;&gt; 8;
        int red = (rgb &amp; 0xff0000) &gt;&gt; 16;
     
        if(red&gt;210&amp;&amp;blue&gt;210&amp;&amp;green&gt;210){
          image.setRGB(x, y, new Color(0, 0, 0).getRGB());
        }
        else{
          image.setRGB(x, y, new Color(255, 255, 255).getRGB());
        }
      }
    }
  3. Do some basic clipping to extract one image for each region on the card you are interested in. Start tesseract for each image.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    for (ImageInfo imageInfo : infos) {
      File outputFile = "the output png file, should be in the same directory for the same cards";
      ImageIO.write(image.getSubimage(imageInfo.leftUpX, imageInfo.leftUpY, imageInfo.rightDownX, imageInfo.rightDownY), "png", outputFile);
      ProcessBuilder processBuilder;
      // -psm 6 works best for my case of one line of text in each image. Remember: each image contain ONE numerical or text value from the card.
      if(imageInfo.number){
        // Restrict tesseract to digits.
        processBuilder = new ProcessBuilder("tesseract", "-psm", "6", outputFile.toString(),
          path.resolve(outputFileNameBase + "_"+imageInfo.name).toString(), "digits");
      }
      else{
        processBuilder = new ProcessBuilder("tesseract", "-psm", "6", outputFile.toString(),
            path.resolve(outputFileNameBase + "_"+imageInfo.name).toString());
      }
      processBuilder.start();
    }
  4. Collect the results from the text file output of tesseract (or read directly from the output stream of the process). Maybe, there is also a batch mode of tesseract???

4. Analyze the results with R

Lets assume you have the results now in a list like this:

Geschwindigkeit;Gewicht;Hubraum;Leistung;Name;Umdrehung;Zylinder
120;19.000;12.800;440;Volvo FH12-440;2.200;6
...

You can read in the values easily in R with:

1
2
# The 5th row contains the names of the cars in my example. This is used as dimension.
cards=read.csv("share/result.csv", sep=";", header=T, row.names=5)

This is read in as a data frame. You can get a first impression of the best values with the summary function:

1
summary(cards)

Write a simple function to get the values for all categories:

1
2
3
bestCandidates=function(attribute){
subset(cards, cards[attribute]==max(cards[attribute], na.rm=T))
}

And apply it:

1
2
dimensions=names(cards)
lapply(dimensions, bestCandidates)

And voila: You have to remember just a few cards with the highest values. Next time I ask for „Zylinder“ in case Ihave the „Scania R620“.

1
2
3
[[6]]
            Geschwindigkeit Gewicht Hubraum Leistung Umdrehung Zylinder
Scania R620             125      26    15.6      620       1.9        8