Scraping Committee Roll-Call Votes 2024

nope badger

The Greens/EFA in the European Parliament generously commissioned parltrack.eu to produce a scraper for the Committee Roll-Call Votes (RCV) that are mandatory to publish since a couple of years. Previously parltrack attempted to do so on its own, but failed due to to the diversity that unites these documents. In the last five years things got only worse.

Summary

There are ~1000 committee vote PDFs available. We were unable to extract any kind of data from 24 PDFs.

The data we are looking for in PDFs are the vote results, the dossier ID, the vote type and the vote date.

The parsed PDFs contain ~15000 tables from which we were able to extract 13975 vote tables, but only 1722 do not contain any kind of missing metadata (12.3%). We were unable to detect at least two fields 4289 times (30%) - these are mostly amendment votes or votes about topics independent from dossiers (e.g. BUDG).

All the committees publish these documents in differently structured PDFs and there are various formats even in a single committee. Having 31 committees with some of the committees using 2-3-4-5 different formats, only 10-15 PDFs have similar structures in average.

Also, because these documents are created manually there are numerous typos and wrongly worded dossier titles and documents have a huge variety of vote types.

Due to the lack of standards it isn't guaranteed that a PDF contains any of the required metadata which makes impossible to detect if the data is wrong or completely missing in quite a few cases. (EMPL still publishes scanned, badly OCRed PDFs.)

There are 110 cases where the vote total number displayed in the vote table is less than the sum of the voters.

Vote Date variations

Vote dates can appear in any random position, sometimes the are in the EP web page where all the votes of the committee are listed, sometimes they are in the title of link, sometimes there is a header above the link containing the date, sometimes there is a header with the year, and a subheader with the day and month. The following link shows a good example of this problem: https://www.europarl.europa.eu/committees/en/econ/meetings/votes

Sometimes dates are written with numbers, sometimes with month names, sometimes seperated with dots, with slashes, with dashes, and possibly other eldritch runes.

Sometimes committees have 2 day meetings and votes are given date ranges, like 21-22.12.2023 or even "Tuesday 21st, and Wednesday 22nd March 2020".

Often dates are also in the cover page of the pdf containing the votes, but as often as not, there is no dates at all in the pdf itself. Sometimes the date is in the filename of the pdf, in all it's variations described above.

These issues are not making it impossible to guess a date, but it does significantly impact the accuracy and our confidence in the correctness of the data extracted.

Considering our research into this topic for 3 FTE months it seems more economic to exert internal and external pressure on the European Parliament to publish high quality structured data, with unambiguous identifiers linking to committee meetings, amendments, MEPs and dossiers, use of ISO standard 8601 dates, and to standardize the vote types themselves.

Dossier ID extraction issues

Extracting dossiers is as fuzzy as it can get, very often the titles of the dossiers are radically shortened, to the point of only refering to the dossier as acronyms (e.g. "ERTMS deployment", "Roll-call: CSRD – 57 Part 2", "Roll-call: FFV Vitoria-Madrid, Spain"), but often only parts of the full dossier title are used, for example

"Introduction of European Social Security Pass"

should be:

"Resolution on the introduction of a European social security pass for improving the digital enforcement of social security rights and fair mobility"

and

"Vocational Education and Training" 

should be:

"Resolution on the Council Recommendation on vocational education and training (VET) for sustainable competitiveness, social fairness and resilience"

These can be matched by a human by doing searches and look-ups, but automating this is difficult. Sometimes the dossier title is flanked by the name of the rapporteur (again, often mangled by typos). There is also a lot of votes that are not related to dossiers, like votes on oral questions to the commission - this really depends on the vote type, which is very difficult to match even moderately reliably. In such cases where dossiers are not involved, it is impossible for us to distinguish between the case were we are not able to identify a dossier in a mangled title and a vote that is not related to a dossier at all.

Correlating the votes pdfs with the committee agendas, does often help us narrow down the list of possible dossiers, but that is neither a surefire way. There is committees that publish each vote of a meeting in a separate pdf, which makes it difficult to track if all the votes from the committee agenda are covered in the pdfs.

Statistics

Missing fields - group by committees

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
COMMITTE  TOTAL_VOTES  ERROR_VOTE_% TOTAL_ERRORS

EMPL           336      100.00       412
LIBE            52      100.00       53
AFET            27      3.70         1
SEDE             2      100.00       2
TRAN           987      100.00       1104
AGRI            34      64.71        24
INTA            30      30.00        12
ITRE             6      100.00       8
IMCO           339      45.43        166
BUDG            45      100.00       65
CONT          2423      100.00       3623
PECH           832      0.24         2
JURI          1285      100.00       2084
PETI           899      100.00       2437
FEMM           198      100.00       206
DROI             3      0.00         0
ECON          1448      100.00       1560
ENVI          3620      86.38        3476
DEVE          1278      97.10        1995
REGI            87      4.60         4
AFCO            21      57.14        12
CULT            23      0.00         0

Unsure fields - group by committees

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
COMMITTE  TOTAL_VOTES  ERROR\_VOTE_% TOTAL_ERRORS

EMPL           336      0.60         2
LIBE            52      40.38        21
AFET            27      22.22        10
SEDE             2      0.00         0
TRAN           987      0.61         6
AGRI            34      26.47        9
INTA            30      0.00         0
ITRE             6      0.00         0
IMCO           339      76.40        274
BUDG            45      4.44         2
CONT          2423      0.33         8
PECH           832      93.39        777
JURI          1285      1.71         22
PETI           899      35.26        317
FEMM           198      3.54         8
DROI             3      66.67        3
ECON          1448      22.10        320
ENVI          3620      16.85        679
DEVE          1278      3.60         46
REGI            87      28.74        26
AFCO            21      38.10        8
CULT            23      17.39        6

Examples

Some of the strongly different document structures

IMCO

https://www.europarl.europa.eu/cmsdata/187620/roll%20call%20votes%20IMCO%208%20October%202019-original.pdf vs https://www.europarl.europa.eu/cmsdata/260598/IMCO%20voting%20session%20-%2012%20December%202022.pdf vs https://www.europarl.europa.eu/cmsdata/215170/FINAL%20IMCO%20Voting%20Session%209%20Nov%202020%20(final%20votes%20with%20correction).pdf

EMPL

By far our favorites: https://www.europarl.europa.eu/cmsdata/277455/RCVs.pdf and https://www.europarl.europa.eu/cmsdata/264219/RCVs_Tuesday%2024%20January%202023.pdf 

AGRI

https://www.europarl.europa.eu/cmsdata/277455/RCVs.pdf vs https://www.europarl.europa.eu/cmsdata/274917/Voting%20Session%2019%20September_General%20budget%20of%20the%20European%20Union%20for%20the%20financial%20year%202024.pdf

ENVI

https://www.europarl.europa.eu/cmsdata/278959/Vote%20results_ENVI_LIBE_28%20November.pdf vs https://www.europarl.europa.eu/cmsdata/279178/2023-11-29%20votes%20and%20roll-call%20votes.pdf

INTA

https://www.europarl.europa.eu/cmsdata/280385/FInal%20votes%20by%20Roll-call%20votes%20AFET-INTA%2024%20January.pdf vs https://www.europarl.europa.eu/cmsdata/270652/ROLL%20CALL_25%20May%202023.pdf

CONT

https://www.europarl.europa.eu/cmsdata/246634/CONT%20Votes%2031%20March%202022%20(final%20vote).pdf vs https://www.europarl.europa.eu/cmsdata/267335/Roll%20call%20votes_Discharge%202021_EC_Amendments-22%20March%202023.pdf vs https://www.europarl.europa.eu/cmsdata/215260/Vote%20of%2012%20November%202020.pdf vs https://www.europarl.europa.eu/cmsdata/244953/CONT%20Vote%2010%20February%202022%20(final%20votes).pdf

IMCO

https://www.europarl.europa.eu/cmsdata/231415/IMCO%20Voting%20Session%2017%20March%202021%20(final%20votes).pdf vs https://www.europarl.europa.eu/cmsdata/279298/RCV%204.12.23_for%20publishing.pdf vs https://www.europarl.europa.eu/cmsdata/250728/Final%20Vote.pdf vs https://www.europarl.europa.eu/cmsdata/187620/roll%20call%20votes%20IMCO%208%20October%202019-original.pdf

JURI

https://www.europarl.europa.eu/cmsdata/267680/2023.03.01_RCV_JURI-LIBE_EN.pdf vs https://www.europarl.europa.eu/cmsdata/280315/2024.01.24_RCV_EN.pdf

BUDG

https://www.europarl.europa.eu/cmsdata/278191/Results%20of%20the%20voting%20session.pdf vs https://www.europarl.europa.eu/cmsdata/280329/Results%20of%20the%20voting%20session%20BUDG-CONT.pdf vs https://www.europarl.europa.eu/cmsdata/280265/Results%20of%20votes.pdf

PETI

https://www.europarl.europa.eu/cmsdata/280216/PETI%20OQ%20working%20conditions%20teachers%20in%20EU-RCV.pdf vs https://www.europarl.europa.eu/cmsdata/159885/roll%20call%20votes%2021-22%20January%202019.pdf

Some of the inconsistencies within a single document

Various vote types: https://www.europarl.europa.eu/cmsdata/238003/Results%20of%20roll%20call%20vote%2012%20July.pdf (PECH)

Mixed order of dossier ID and rapporteur: https://www.europarl.europa.eu/cmsdata/279062/Results%20of%20roll%20call%20vote%2029%20November%202023.pdf (PECH)

Rapporteurs sometimes has group after their name, sometimes the role in the committee: https://www.europarl.europa.eu/cmsdata/279298/RCV%204.12.23_for%20publishing.pdf (IMCO)

Vote type variations

All the vote types below are unique even those that look like the same.

Opinion votes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
'Adoption of a draft opinion'
'Adoption of a draft opinion in letter form'
'Adoption of draft opinion in letter form'
'EU 2022 Budgetary opinion – Final vote'
'Final vote on the adoption of draft opinion'
'Final vote on the draft opinion'
'Final vote on the draft opinion in form of a letter'
'Final vote on the opinion'
'Final voteon theopinion'
'FrameworkofethicalaspectsofArtificial Intelligence, robotics andrelatedtechnologies (2020/2012 (INL))-A. Geese (adoption of draft opinion)'
'Roll-call:– Single vote - Adoption of draft opinion in the form of a letter'
'Roll-call:– Single vote - adoption of draft opinion in the form of a letter'
'Vote on a draft opinion in letter form'
'Vote on draft opinion in letter form'
'Vote on the opinion'
'Vote on the opinion in letter form'
'adoption of draft opinion'
'draft opinion'
'of draft opinion)'
'opinion final'
'to 2027 - Opinion in letter form - Rapporteur - Adam Jarubas (EPP)'

Final votes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
' FINAL VOTE BY ROLL CALL IN COMMITTEE ASKED FOR OPINION'
' FINAL VOTE BY ROLL CALL IN COMMITTEE RESPONSIBLE'
'1. Final vote'
'2020 Discharge EFCA – Final vote'
'Better Law Making 2017 - 2019 – Final vote'
'Blue economy voting list – Final vote'
'CALL: Final vote'
'EU 2022 Budgetary opinion – Final vote'
'EU-Mauritania SPFA Resolution – Final vote'
'Establishing a Recovery and Resilience Facility – Final Vote (Pascal Canfin (Chair))'
'FINAL VOTE BY ROLL CALL IN COMMITTEE OPINION'
'FINAL VOTE BY ROLL CALL IN COMMITTEES RESPONSIBLE'
'Farm to Fork Strategy – Final vote'
'Final vote (rejection)'
'Final vote on the adoption of draft opinion'
'Final vote on the draft opinion'
'Final vote on the draft opinion in form of a letter'
'Final vote on the draft recommendation'
'Final vote on the draft recommendation for second reading'
'Final vote on the draft report'
'Final vote on the opinion'
'Final vote on the recommendation'
'Final vote on the report'
'Final vote on the resolution'
'Final vote- Draft as amended'
'Final voteon theopinion'
'Indo-Pacific Strategy – Final vote'
'Measures for a high common level of cybersecurity (NIS 2) – Final vote'
'Roaming on public mobile communications networks within the Union – Final vote'
'Roll-call: 2021 UN Climate Change Conference in Glasgow, UK (COP26) – Final vote'
'Roll-call: A Pharmaceutical Strategy for Europe – Final vote'
'Roll-call: Final vote'
'Roll-call: Implementation report on on-farm animal welfare – Final vote'
'Roll-call: Sustainable and Smart Mobility Strategy – Final vote'
'Roll-call:– Final vote'
'SCF – Final vote'
'Taxation of energy products - Mato – Final vote'
'draft final'
'final'
'opinion final'
'responsible final'
'·Final vote'