You probably are sitting on hot coals now hoping for some juicy
datamining with python and our no-db console. But first let's have a
short digression showing a super-lightweight way of mining the
data. If you never heard of the amazing jq
tool, it's high time (here's a nice tutorial. It
allows you do work with JSON data. And Parltrack datasets are in JSON
format, a lucky but not random coincidence! Let's see how you can do
the same "query" we did in the first post in this series:
1 2 3 4 5 |
|
This example assumes you have downloaded the ep_meps.json.lz
dataset,
but not decompressed it, hence the first line. To understand the
second line we just quote our https://parltrack.org/dumps page:
Due to most of the dumps being between 400 and 800 megabytes (at the time of writing in mid 2019) they might not be suitable to load all at once since when loaded into RAM they might use significantly more memory. To facilitate a record-by-record stream processing of these dumps, they are formatted in the following way, each line is one record, each prefixed either with:
'[' for the first record, ',' for the other records, ']' on its own for the last line
This means you can read the uncompressed JSON line-by-line, strip of the first character and process the rest of the line as JSON, you can stop processing if after stripping the first character an empty string remains, this means the end of the JSON stream.
Thus in our example above in the second line we strip the first
character. In the third line we conserve memory and instead of loading
the whole dataset at once, we use the fact that each record is on one
line. We read each line and pass it to jq
seperately. This way we
use much less memory and process the whole thing much quicker.
The fourth line is the line where the jq
magic happens, we select
only records of MEPs that are active, and write out their full names
and Twitter ids.
For the more lazy copy/paster readers among you, here is an example to extract the e-mail addresses of the currently active MEPs from an already lzip-decompressed dataset:
1 2 3 4 5 |
|
and here is a variation that outputs all the links to the current MEPs Declarations of Participation, these are PDF files where MEPs admit to participate in some paid, non-neutral event:
1 2 3 4 5 |
|
"If you are into magical unix-one-liners parltrack dumps and jq are a perfect ingredients for an eternal friendship" romanticizes George W. Hayduke IV, second-level customer supporter.