What: Extract a list of german cities and countries from wikipedia
Why: Get a list of german cities for text processing
How: Using Beautifulsoup, Requests and Python
Introduction
Wikipedia contains a list of german cities and towns. This list is formatted in html and needs to be processed for further automatic processing. Additionally, for each city the country is mentioned.
Code
Below is the python code for extracting the list. The url and the processing of the page specific search via Beautifulsoup is hard encoded. The wikipedia page uses a 2-letter encoding for the countries, which is mapped to the full country name.
import requests from bs4 import BeautifulSoup class CityList: def __init__(self): self.__countries={ 'BY':'Bayern', 'BW':'Baden-Württemberg', 'NW':'Nordrhein-Westfalen', 'HE':'Hessen', 'SN':'Sachsen', 'NI':'Niedersachsen', 'RP':'Rheinland-Pfalz', 'TH':'Thüringen', 'BB':'Brandenburg', 'ST':'Sachsen-Anhalt', 'MV':'Mecklenburg-Vorpommern', 'SH':'Schleswig-Holstein', 'SL':'Saarland', 'HB':'Bremen', 'BE':'Berlin', 'HH':'Hamburg' } def retrieveGermanList(self): r = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland') soup = BeautifulSoup(r.content, "html5lib") cities={} tables=soup.find_all('table') for t in tables: lis=t.find_all('dd') for l in lis: # All countries are in brackets after the city name. # Some cities are listed like: SN, Landeshauptstadt countryShort=None additional=l.contents[1].split('(')[1].split(')')[0].strip() if ',' in additional: countryShort=additional.split(',')[0] else: countryShort=additional cities[l.find('a').contents[0]]=countries[countryShort] return cities
The code can be tested via the following snippet, which can be embedded as self test in the same script, where the CityList class is defined.
import unittest class TestCityList(unittest.TestCase): def setUp(self): self.__out=CityList() def test_retrieveGermanList(self): self.assertEqual('Sachsen', self.__out.retrieveGermanList()['Dresden']) self.assertEqual('Sachsen', self.__out.retrieveGermanList()['Görlitz']) self.assertEqual('Bayern', self.__out.retrieveGermanList()['München']) self.assertEqual('Hamburg', self.__out.retrieveGermanList()['Hamburg']) suite = unittest.TestLoader().loadTestsFromTestCase(TestCityList) unittest.TextTestRunner().run(suite)
Usage
Use it from within python:
CityList().retrieveGermanList()
The output will be something like:
[...,
'Vohenstrauß': 'Bayern',
'Neuötting': 'Bayern',
'Eggenfelden': 'Bayern',
'Gernsheim': 'Hessen',
'Braunsbedra': 'Sachsen-Anhalt',
'Tegernsee': 'Bayern',
...]