Announcing two updated and one new scraper

nope badger

Today we release major updates to some of our scrapers:

Plenary scraper

The Plenary scraper is a new addition to the family of parltrack scrapers. The plenary scraper comes with a whole new pdf-to-text back-end based on pdfplumber. We use this back-end to scrape all the amendments submitted to the plenary vote - at least those that are published as PDFs. Some of the amendments get published directly in the HTML page, these we also scrape. As a little bonbon we also compare if the HTML amendments are actually the same as the PDF amendments (check out the "inconsistent" field in our data dump), and it turns our there is a lot of discrepancies, some even go as far as having a sentence in HTML which is negated by an added "no" in the PDF version, makes you wonder...

One limitation of the amendments scrapers - PDF and HTML - are, that often instead of distinct amendments the complete text is published with bold text marking new or changed text, while a special character is used to signal the location of deleted text. These complete versions with font markup we do not scrape.

The very much most important feature of the amendment scraper is that it also automatically finds plenary votes about the amendment. So you can see the actual amendment that the vote was about. This is currently not exposed on the web-interface yet, but that might change soon.

Another very cool feature of the plenary scraper is, that despite our capitulation to scrape committee roll-call votes the plenary pages actually contain the results of the final votes of the committees. At least these we scrape, which is better than nothing.

Checkout the schema of this new data set here

Amendments scraper

As a collateral benefit of the plenary amendments scraper the committee amendments scraper also got an update. We ditched the old pdftotext back-end and use the same back-end as the plenary amendment scraper. With this change we achieve a better data quality, reflecting the metadata and the amendments themselves much more accurately. One drawback though, the speed of scraping PDFs slowed down by a factor of 8, but we feel it is totally worth it.

Comagenda scraper

Despite our failure to provide scrapers for committee roll-call votes there is a consolation price: we found a new data-source - in JSON no less! - for the committee agendas. And thus we have much more comprehensive, reliable, up-to-date and accurate information on committee agendas.

Checkout the schema of this new data set here

Thanks

This work was kindly financed by the Greens/EFA group in the European Parliament.