Parltrack blog - Playing with the data locally

It might not be obvious but the main goal of Parltrack is to liberate the data from the many horrible ways it is published by the European Parliament.

Context

This website here where you can browse some of the data is an editiorialized interpretation. This website here is meant to help the countless underfunded NGOs that engage on a daily basis with law-making and resourceful lobbyists in Brussels. This website is not meant for the broad public, using it needs understanding of the workings of the legislative process, which is sadly not common knowledge. If you want to know more how this legislative process works, have a look the European Parliament has a good explainer: also EDRi has a very good guide [PDF]

Don't Scrape, Download Dumps

Anyway, if you wanna start playing with the data of Parltrack, do not start your own scrapers, that just puts uneccsary burden on the EP webservers. Just download the data from the Parltrack dumps section. So far so good you have the data, you might have even looked at the schema what to expect in each of the dumps, but how to start digging deeper in the data?

History

Back when we were rewriting Parltrack we had a goal to get rid of mongodb, we struggled to migrate the data into a schemaful postgres, and we failed, this we gave up. But that did not mean we gave up on moving away from mongodb. Out of a joke we tried just loading all the data into memory directly, and it actually worked, that was when the joke turned serious. That was the birth of our no-db backend. This backend is really just a simple server that you can connect via a Unix Domain Socket and make "queries". We put an ipython console for good measure in there, so we can directly play with the data. This is our no-db console, which is the subject of this post.

Warning

Loading all the dumps at once in memory (as of the time of this writing at the beginning of November 2020) you need a bit more than 13G of memory. It's ok if you have less, but enough virtual memory, but in this case things might be much slower. It is possible to not load all the dumps, but only the ones you are interested, for example the MEPs dump (without the MEP activities) is only about 800MB in memory.

Setup

So you wanna play with the Partrack data, and you know some python. Excellent, here's how to get started:

First you want to have a fresh copy of the Parltrack sources:

git clone https://github.com/parltrack/parltrack.git
cd parltrack

Then you want to create the db directory and download at least one of our dumps:

mkdir db
cd db
wget https://parltrack.org/dumps/ep_meps.json.lz
lzip -d ep_meps.json.lz
cd ..

Notice, our dumps are compressed with lzip, which is a better compression format than the ususal suspects. You can download it for windows from here

After having the sources and the data, you also need to install some of the 3rd party dependencies that you need to get our no-db console running (assuming you already have ipython3 installed globally)

virtualenv --system-site-packages env
source env/bin/activate
pip install msgpack BeautifulSoup4 'cachecontrol[filecache]' sh

These are the minimum requirements, but you can also install all the rest using the requirements.txt file if you feel like it, but for the console this is not necessary.

We should now be ready to start the console, which takes a bit of time:

1	`python3 db.py`

And off you go, write your first query listing the twitter addresses of all currently active MEPs:

[m['Twitter'] for m in IDXs["meps_by_activity"]['active'] if 'Twitter' in m]

Next time we'll have a look where to go from here and what this mystical IDXs dictionary is.

Happy Hacking!

"Going from the dreaded mongodb backend to the no-db backend really made our lives easier" retrospects Maurice Moss, HS DevOps Counsellor.