Parltrack blog - Response to EP Inquiry

A few dasy ago we got an inquiry from Mr Petrov who works for the EP "Directorate-General for Innovation and Technological Support, Data Management, Document Production Unit". We welcome the EPs interest in this topic, and are happy to answer these question to the best of our knowledge. Please see our answer below:

Subject: Re: Open Data project at the European Parliament

Dear Mr Petrov,

Thank you very much for your inquiry.

On Thu, Aug 20, 2020 at 02:32:40PM +0000, PETROV G. wrote:

A project about Open Data at the European Parliament has recently been launched. As you are one of the users of our currently available public data, I would like to ask you if you could tell me more about your experience with our data:

What is the source of your EP data? (Do you use our open data setshttps://data.europa.eu/euodp/en/data/publisher/ep which we publish on EU Open Data Portal?)

No Parltrack does not use data published on the EU Open Data Portal.

For MEPs we scrape the HTML from the europarl website and various extra pdf files published there for the declarations, and more HTML for the parliamentary activities and meetings.
For dossiers we scrape HTML from OEIL.
For committee votes we look at the committee minutes/vote results listed in the dossiers, then we scrape the HTML format version of this.
For plenary votes, we look for the dates of plenary votes on dossiers and then construct URLs with these templates:

      http://www.europarl.europa.eu/RegData/seance_pleniere/proces_verbal/%s/votes_nominaux/xml/P%s_PV%s(RCV)_XC.xml
      http://www.europarl.europa.eu/RegData/seance_pleniere/proces_verbal/%s/liste_presence/P%s_PV%s(RCV)_XC.xml

and process these XML documents

For amendments we directly scrape the pdf documents, e.g.

      http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//NONSGML+COMPARL+PE-609.623+01+DOC+PDF+V0//EN&language=EN

which is a quite error-prone process, due to format itself being difficult to parse. It would be much better if AT4AM (if it is still used) would be able to export the amendments in machine processable formats.

We rely on the committee document search to find the documents of our interest, this is a sample template url finding amendment docs (AMCO):

      https://www.europarl.europa.eu/committees/en/documents/search?committeeMnemoCode=%s&textualSearchMode=TITLE& \
                 textualSearch=&documentTypeCode=AMCO&reporterPersId=&procedureYear=&procedureNum=&procedureCodeType=& \
                 peNumber=&aNumber=&aNumberYear=&documentDateFrom=&documentDateTo=&meetingDateFrom=&meetingDateTo=& \
                 performSearch=true&term=%s&page=%s&pageSize={}

For committee agendas we also rely on the committee document search form as shown in the template URL above, but we scrape the HTML of pages like this:

      http://www.europarl.europa.eu/sides/getDoc.do?type=COMPARL&reference=LIBE-OJ-20120112-1&language=EN

What data do you use?

This question overlaps with the previous one, please see our answers above.

What do you like about the current set-up?

Uhm. The least worst is the XML data for the plenary votes. But even plenary votes sadly lack the MEP id of the MEPs doing the vote, there seems to be an id in there, but i does not correspond to the id on the europarl website listing the MEPs, this causes a lot of headaches.

What would you like to see change to make your life easier?

We promise, if all these things below get implemented, we will stop working on parltrack, because then it will cease having a purpose in this world:

Publish by adhering to the "The 8 Principles of Open Government Data" and the 7 additional principles also specified there.

Follow these data publishing practices:

Publish daily updates as separate downloads (so that incremental updates are possible).
Publish daily complete dumps of the data sources listed below.
Publish in JSON (machine readable, lightweight to process) - do not get fancy and do not do LDJSON, plain JSON is more than enough, for texts maybe consider semantically marked-up HTML5 or XML, Akoma Ntoso is a horrible UN standard, if possible avoid it.
Provide HTTP URLs for download, but also bit-torrent files to conserve bandwidth of the europarl servers.
Prove authenticity by signing all datasets cryptographically (preferably using an ed25519 key).

Publish at least the following data sets:

All the MEP data currently published under https://www.europarl.europa.eu/meps/en/ - especially:
- membership and roles in parliamentary committees/delegations/groups/conferences
- CVs of the MEP
- main and other parliamentary activities and all related content like questions, interpellations, speeches, etc clearly linked with dossiers on OEIL if applicable.
- Declarations (not only in pdf! but in machine parsable format) of good conduct/participation/financial interests
- meetings with representatives of interests
- history of parliamentary service
- assistants - not only names, but also CVs, employment/service contracts and most importantly meetings with representatives of interests!
All the data published on OEIL
All committee agendas as soon as they are available internally, marked with updates
All the roll-call votes in plenary and committees as soon as possible after the vote, attendance of committee and plenary session.
All amendments in simple semantically marked up XML

All these datasets should link to OEIL dossiers when applicable, and providing MEP ids as used on the europarl website

What data sources are missing and you'd like us to publish as open data?

Most importantly committee agendas, amendments, MEP assistants CVs work/employment/service contracts and their meetings with representatives of interests (with unambiguous ids from the lobby register).

Also uniform committee RCV results, it seems different committees have different formats of publishing this info.

A few years ago we also had the gender of the MEPs (in the french version where the location of birth was given using née/né), can we have explicit genders please back? Currently we have volunteers updating the gender manually in our database (shout out to them, big thanks X!)

Any other points that you'd like to raise with us?

Please do not change URLs, if you do, provide redirects from the old URLs to the new ones.
If the design of the website changes (and the data is - not yet published in a machine readable format - only available in HTML) please use semantic markup like class and id attributes to make the processing easier.
do not remove data that was available before (see e.g. the gender issue).
allow all HTTP post requests to also be available via GET requests, if these requests are not changing state in your DBs - especially in search forms.
provide ETAG values for caching of any content provided via http.
use standardized formats for dates (iso8601), names (of meps, groups, committees, etc) and dossiers everywhere, perhaps also run a spellchecker on names before publishing.
if you publish a list of things - like names of meps - mark them semantically, not only by using a comma.

best regards,

Stefan Marsiske

Parltrack

"We have no illusions, these things will not happen overnight - possibly never -, but even just a few of these things improve, that's already a benefit! And to be honest, between the 3 european legislative institutions the EP is by far the most open and transparent, by having this dialogue they are distancing themselves even further from the mediocre openness of the Commission and the abysmal and undemocratic secrecy of the Council. We welcome any dialogue with the EP in this regard, maybe one day we can make our dream true and can stop working on Parltrack." remarks Bartholomew Sharp - Vice-President for Sharpening Sticks, Stones and other Scraping Devices.