The Greens/EFA in the European Parliament generously commissioned parltrack.eu to produce a scraper for the Committee Roll-Call Votes (RCV) that are mandatory to publish since a couple of years. Previously parltrack attempted to do so on its own, but failed due to to the diversity that unites these documents. In the last five years things got only worse.
Summary
There are ~1000 committee vote PDFs available. We were unable to extract any kind of data from 24 PDFs.
The data we are looking for in PDFs are the vote results, the dossier ID, the vote type and the vote date.
The parsed PDFs contain ~15000 tables from which we were able to extract 13975 vote tables, but only 1722 do not contain any kind of missing metadata (12.3%). We were unable to detect at least two fields 4289 times (30%) - these are mostly amendment votes or votes about topics independent from dossiers (e.g. BUDG).
All the committees publish these documents in differently structured PDFs and there are various formats even in a single committee. Having 31 committees with some of the committees using 2-3-4-5 different formats, only 10-15 PDFs have similar structures in average.
Also, because these documents are created manually there are numerous typos and wrongly worded dossier titles and documents have a huge variety of vote types.
Due to the lack of standards it isn't guaranteed that a PDF contains any of the required metadata which makes impossible to detect if the data is wrong or completely missing in quite a few cases. (EMPL still publishes scanned, badly OCRed PDFs.)
There are 110 cases where the vote total number displayed in the vote table is less than the sum of the voters.
Vote Date variations
Vote dates can appear in any random position, sometimes the are in the EP web page where all the votes of the committee are listed, sometimes they are in the title of link, sometimes there is a header above the link containing the date, sometimes there is a header with the year, and a subheader with the day and month. The following link shows a good example of this problem: https://www.europarl.europa.eu/committees/en/econ/meetings/votes
Sometimes dates are written with numbers, sometimes with month names, sometimes seperated with dots, with slashes, with dashes, and possibly other eldritch runes.
Sometimes committees have 2 day meetings and votes are given date ranges, like 21-22.12.2023 or even "Tuesday 21st, and Wednesday 22nd March 2020".
Often dates are also in the cover page of the pdf containing the votes, but as often as not, there is no dates at all in the pdf itself. Sometimes the date is in the filename of the pdf, in all it's variations described above.
These issues are not making it impossible to guess a date, but it does significantly impact the accuracy and our confidence in the correctness of the data extracted.
Considering our research into this topic for 3 FTE months it seems more economic to exert internal and external pressure on the European Parliament to publish high quality structured data, with unambiguous identifiers linking to committee meetings, amendments, MEPs and dossiers, use of ISO standard 8601 dates, and to standardize the vote types themselves.
Dossier ID extraction issues
Extracting dossiers is as fuzzy as it can get, very often the titles of the dossiers are radically shortened, to the point of only refering to the dossier as acronyms (e.g. "ERTMS deployment", "Roll-call: CSRD – 57 Part 2", "Roll-call: FFV Vitoria-Madrid, Spain"), but often only parts of the full dossier title are used, for example
"Introduction of European Social Security Pass"
should be:
"Resolution on the introduction of a European social security pass for improving the digital enforcement of social security rights and fair mobility"
and
"Vocational Education and Training"
should be:
"Resolution on the Council Recommendation on vocational education and training (VET) for sustainable competitiveness, social fairness and resilience"
These can be matched by a human by doing searches and look-ups, but automating this is difficult. Sometimes the dossier title is flanked by the name of the rapporteur (again, often mangled by typos). There is also a lot of votes that are not related to dossiers, like votes on oral questions to the commission - this really depends on the vote type, which is very difficult to match even moderately reliably. In such cases where dossiers are not involved, it is impossible for us to distinguish between the case were we are not able to identify a dossier in a mangled title and a vote that is not related to a dossier at all.
Correlating the votes pdfs with the committee agendas, does often help us narrow down the list of possible dossiers, but that is neither a surefire way. There is committees that publish each vote of a meeting in a separate pdf, which makes it difficult to track if all the votes from the committee agenda are covered in the pdfs.
Statistics
Missing fields - group by committees
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Unsure fields - group by committees
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Examples
Some of the strongly different document structures
IMCO
https://www.europarl.europa.eu/cmsdata/187620/roll%20call%20votes%20IMCO%208%20October%202019-original.pdf vs https://www.europarl.europa.eu/cmsdata/260598/IMCO%20voting%20session%20-%2012%20December%202022.pdf vs https://www.europarl.europa.eu/cmsdata/215170/FINAL%20IMCO%20Voting%20Session%209%20Nov%202020%20(final%20votes%20with%20correction).pdf
EMPL
By far our favorites: https://www.europarl.europa.eu/cmsdata/277455/RCVs.pdf and https://www.europarl.europa.eu/cmsdata/264219/RCVs_Tuesday%2024%20January%202023.pdf
AGRI
https://www.europarl.europa.eu/cmsdata/277455/RCVs.pdf vs https://www.europarl.europa.eu/cmsdata/274917/Voting%20Session%2019%20September_General%20budget%20of%20the%20European%20Union%20for%20the%20financial%20year%202024.pdf
ENVI
https://www.europarl.europa.eu/cmsdata/278959/Vote%20results_ENVI_LIBE_28%20November.pdf vs https://www.europarl.europa.eu/cmsdata/279178/2023-11-29%20votes%20and%20roll-call%20votes.pdf
INTA
https://www.europarl.europa.eu/cmsdata/280385/FInal%20votes%20by%20Roll-call%20votes%20AFET-INTA%2024%20January.pdf vs https://www.europarl.europa.eu/cmsdata/270652/ROLL%20CALL_25%20May%202023.pdf
CONT
https://www.europarl.europa.eu/cmsdata/246634/CONT%20Votes%2031%20March%202022%20(final%20vote).pdf vs https://www.europarl.europa.eu/cmsdata/267335/Roll%20call%20votes_Discharge%202021_EC_Amendments-22%20March%202023.pdf vs https://www.europarl.europa.eu/cmsdata/215260/Vote%20of%2012%20November%202020.pdf vs https://www.europarl.europa.eu/cmsdata/244953/CONT%20Vote%2010%20February%202022%20(final%20votes).pdf
IMCO
https://www.europarl.europa.eu/cmsdata/231415/IMCO%20Voting%20Session%2017%20March%202021%20(final%20votes).pdf vs https://www.europarl.europa.eu/cmsdata/279298/RCV%204.12.23_for%20publishing.pdf vs https://www.europarl.europa.eu/cmsdata/250728/Final%20Vote.pdf vs https://www.europarl.europa.eu/cmsdata/187620/roll%20call%20votes%20IMCO%208%20October%202019-original.pdf
JURI
https://www.europarl.europa.eu/cmsdata/267680/2023.03.01_RCV_JURI-LIBE_EN.pdf vs https://www.europarl.europa.eu/cmsdata/280315/2024.01.24_RCV_EN.pdf
BUDG
https://www.europarl.europa.eu/cmsdata/278191/Results%20of%20the%20voting%20session.pdf vs https://www.europarl.europa.eu/cmsdata/280329/Results%20of%20the%20voting%20session%20BUDG-CONT.pdf vs https://www.europarl.europa.eu/cmsdata/280265/Results%20of%20votes.pdf
PETI
https://www.europarl.europa.eu/cmsdata/280216/PETI%20OQ%20working%20conditions%20teachers%20in%20EU-RCV.pdf vs https://www.europarl.europa.eu/cmsdata/159885/roll%20call%20votes%2021-22%20January%202019.pdf
Some of the inconsistencies within a single document
Various vote types: https://www.europarl.europa.eu/cmsdata/238003/Results%20of%20roll%20call%20vote%2012%20July.pdf (PECH)
Mixed order of dossier ID and rapporteur: https://www.europarl.europa.eu/cmsdata/279062/Results%20of%20roll%20call%20vote%2029%20November%202023.pdf (PECH)
Rapporteurs sometimes has group after their name, sometimes the role in the committee: https://www.europarl.europa.eu/cmsdata/279298/RCV%204.12.23_for%20publishing.pdf (IMCO)
Vote type variations
All the vote types below are unique even those that look like the same.
Opinion votes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Final votes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|