JQ - a diversion

You probably are sitting on hot coals now hoping for some juicy datamining with python and our no-db console. But first let's have a short digression showing a super-lightweight way of mining the data. If you never heard of the amazing jq tool, it's high time (here's a nice tutorial. It allows you do work with JSON data. And Parltrack datasets are in JSON format, a lucky but not random coincidence! Let's see how you can do the same "query" we did in the first post in this series:

1
2
3
4
5
lzip -dc ep_meps.json.lz |
     sed 's/.\(.*\)/\1/' |
     while read rec; do
          jq -c 'select(.active)|.Name.full, .Twitter'
    done

This example assumes you have downloaded the ep_meps.json.lz dataset, but not decompressed it, hence the first line. To understand the second line we just quote our https://parltrack.org/dumps page:

Due to most of the dumps being between 400 and 800 megabytes (at the time of writing in mid 2019) they might not be suitable to load all at once since when loaded into RAM they might use significantly more memory. To facilitate a record-by-record stream processing of these dumps, they are formatted in the following way, each line is one record, each prefixed either with:

'[' for the first record, ',' for the other records, ']' on its own for the last line

This means you can read the uncompressed JSON line-by-line, strip of the first character and process the rest of the line as JSON, you can stop processing if after stripping the first character an empty string remains, this means the end of the JSON stream.

Thus in our example above in the second line we strip the first character. In the third line we conserve memory and instead of loading the whole dataset at once, we use the fact that each record is on one line. We read each line and pass it to jq seperately. This way we use much less memory and process the whole thing much quicker.

The fourth line is the line where the jq magic happens, we select only records of MEPs that are active, and write out their full names and Twitter ids.

For the more lazy copy/paster readers among you, here is an example to extract the e-mail addresses of the currently active MEPs from an already lzip-decompressed dataset:

1
2
3
4
5
cat db/ep_meps.json |
     sed 's/.\(.*\)/\1/' |
     while read rec; do
         jq -c 'select(.active)|.Name.full, .Mail'
     done

and here is a variation that outputs all the links to the current MEPs Declarations of Participation, these are PDF files where MEPs admit to participate in some paid, non-neutral event:

1
2
3
4
5
cat db/ep_meps.json |
     sed 's/.\(.*\)/\1/' |
     while read rec; do
         jq -c 'select(.active and .["Declarations of Participation"])|.Name.full, .["Declarations of Participation"][].url'
     done

"If you are into magical unix-one-liners parltrack dumps and jq are a perfect ingredients for an eternal friendship" romanticizes George W. Hayduke IV, second-level customer supporter.