How to Scrape a Parliament? - S02E01

frustrated computer user

WTF EP HTML?!

Parltrack 2.0 attempts to make sense of the enormous mess of data comprised in the European Parliament (EP) website. By emphasizing "enormous mess" we actually mean that it sometimes feels like efforts have been put to ensure that this parliamentary data is not easily accessible, readable, understandable and reusable, despite some recent update of the EP's website. Some may believe that the institution fails at providing due diligence in enabling the EU people to use and reuse parliamentary data... Instead of attributing anything to malice or to incompetence, let's rather scrape, copy and re-order that mess!

Some of the trickiest data to make sense of are the MEPs' activities data (a sort of feed per member, with which dossiers they are working on as rapporteurs or shadow-rapporteurs, their speeches, oral and written questions, etc. (which amendments they have tabled, withdrawn, etc.) ).

This data is essential to collect, archive and organize in order to build effective views on the parliament, groups and members' activies, but also to track eventual changes, in order to be able to monitor influence and enforce strict accountability.

Cosmetic Changes and Omissions in the EP site

Before the website update in late 2018 the EP was publishing the activities of MEPs as JSON feeds, which were directly dumped in the Parltrack database, but was neither part of the tracking of changes and neither exposed on the web interface.

After the EP website cosmetic changes of late 2018 the activites of MEPs are now exposed as HTML snippets, much less accessible and re-usable. Also notable is that some information that was part of the JSON dump is now missing from the HTML source! Complete records were missing, like for example one MEP hadn't their rapporteurship for a dossier listed or some Motions were absent from the new site! It did not seem like systematic omission, rather it seemed like the data has been manually merged and sometimes some records were forgotten to merge.

Parltrack Jones & the Raiders of the Lost Data...

Some of these missing snippets have been reported to the EP and, after improbable bureaucratic back-and-forth between various parties until our message reached the ["competent service"][comp] (tells a lot about the other services...) it seems these missing entries have been manually fixed...

The old JSON stream seemed to be some dump of a generalized database table, in which certain fields were only filled for certain types of activities, this means that there was a lot of fields in the dump that were empty, and yet explicitly listed. In parltrack v2 these unused fields are not used anymore and this reduced the dump of the complete MEP table by about 20%.

Let's hope the data format of the Parliament will not change any time soon before we finish upgrading Parltrack to v2.0...

"Parltrack 2.0 recycles the huge mess of inconsistent, polluted data from the European Parliament, turning them into fresh organized information available for everyone to browse, reuse and remix. We wish it enables to collectively monitor the activity of the elected representatives/employees of the EU in order to watch them and hold them accountable." declares Theodor Zoidberg, vice-director in charge of the scraping for Parltrack 2.0.

[comp] Once reached, "competent service" told us that curl, the venerable Unix data transfer tool that was used to diagnose their missing data and help them fix their error, was a "quote(?) non-standard tool"... sic.