German cities list

What: Extract a list of german cities and countries from wikipedia
Why: Get a list of german cities for text processing
How: Using Beautifulsoup, Requests and Python

Introduction

Wikipedia contains a list of german cities and towns. This list is formatted in html and needs to be processed for further automatic processing. Additionally, for each city the country is mentioned.

Code

Below is the python code for extracting the list. The url and the processing of the page specific search via Beautifulsoup is hard encoded. The wikipedia page uses a 2-letter encoding for the countries, which is mapped to the full country name.

import requests
from bs4 import BeautifulSoup

class CityList:
    def __init__(self):
        self.__countries={
            'BY':'Bayern',
            'BW':'Baden-Württemberg',
            'NW':'Nordrhein-Westfalen',
            'HE':'Hessen',
            'SN':'Sachsen',
            'NI':'Niedersachsen',
            'RP':'Rheinland-Pfalz',
            'TH':'Thüringen',
            'BB':'Brandenburg',
            'ST':'Sachsen-Anhalt',
            'MV':'Mecklenburg-Vorpommern',
            'SH':'Schleswig-Holstein',
            'SL':'Saarland',
            'HB':'Bremen',
            'BE':'Berlin',
            'HH':'Hamburg'
        }
        
    def retrieveGermanList(self):
        r = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland')
        soup = BeautifulSoup(r.content, "html5lib")
        
        cities={}
        tables=soup.find_all('table')
        for t in tables:
            lis=t.find_all('dd')
            for l in lis:
                # All countries are in brackets after the city name.
                # Some cities are listed like: SN, Landeshauptstadt
                countryShort=None
                additional=l.contents[1].split('(')[1].split(')')[0].strip()
                if ',' in additional:
                    countryShort=additional.split(',')[0]
                else:
                    countryShort=additional
                cities[l.find('a').contents[0]]=countries[countryShort]
                
        return cities

The code can be tested via the following snippet, which can be embedded as self test in the same script, where the CityList class is defined.

import unittest

class TestCityList(unittest.TestCase):
    
    def setUp(self):
        self.__out=CityList()

    def test_retrieveGermanList(self):
        self.assertEqual('Sachsen', self.__out.retrieveGermanList()['Dresden'])
        self.assertEqual('Sachsen', self.__out.retrieveGermanList()['Görlitz'])
        self.assertEqual('Bayern', self.__out.retrieveGermanList()['München'])
        self.assertEqual('Hamburg', self.__out.retrieveGermanList()['Hamburg'])

suite = unittest.TestLoader().loadTestsFromTestCase(TestCityList)
unittest.TextTestRunner().run(suite)

Usage

Use it from within python:

CityList().retrieveGermanList()

The output will be something like:

[...,
'Vohenstrauß': 'Bayern',
'Neuötting': 'Bayern',
'Eggenfelden': 'Bayern',
'Gernsheim': 'Hessen',
'Braunsbedra': 'Sachsen-Anhalt',
'Tegernsee': 'Bayern',
...]