How to Scrape a Parliament? - S02E23


frustrated computer user

The rewrite of Parltrack to version 2.0 aims at enabling better reuse of EU Parliament data, but also to make more sense out of it, for instance by keeping track of changes over time.

In Parltrack v1, Information related to Members of the European Parliament (MEP)'s activities was stored in the MEP records and thus also in the MEP dump. It was unusable for tracking changes through the use of our standard "diffing" techniques.

In Parltrack v1, the activities of the MEPs made up the most part of the MEP database (407 MB of the initial dump for MEPs activities vs 109MB of other MEP data including changelogs).

In v2 we decided to split the MEPs activities into a separate table (ep_mep_activities) to facilitate better reuse of both datasets - meps and activities. You can check out the "schema" for this database here. Also starting in v2 the changes in the activities themselves are also logged in the database (changelog).

However since we never loged them before we opted to not changelog the first release. instead we publish two extra datasets for historical record, one containing all last result of the scraping before the EP website changed, containing all info on meps and activities from the JSON source in one big JSON dump, and additionally we archive the very first version of the activities after the change of the EP website. if you intend to use the changelogs to reproduce intermediary states you might want this primal activities json dump as a reference for the very first state, that preceeded the empty state.

In other words, when you start tracking changes of something you start with an empty object, then on the first commit you store the difference between the empty object and the first state. in our activities dataset we do ommit this diff between empty and the first state, since the first state is quite huge.

This is a technical detail, but for people who want to reproduce previous versions of the activities it might be an important detail to note.

"Tracking and also logging the changes over time of the activities of MEPs opens up a whole new dimension to profiling our elected representatives. We can now glimpse who are the active and the passive MEPs, we can see which MEP prefers which instrument and of course also see the areas of interests which make MEPs do stuff" summarizes Leopold Kaczinsky HTML fragment scraper extraordinaire.

Feel free to contact the Parltrack dev team if you have further questions about data structure and its potential re-use in your own applications and sites! <3