NO-DB Console - Part IV - Advanced Co-Authorship Mining

The following snippet is just for warming up, you should be able to figure this out yourself, if you got the snippets from the previous installment of this series (hint it has nothing to do with co-authorship):

1
2
3
4
5
6
7
8
{tuple(m.items()) + (VOTE['url'], VOTE['voteid'])
 for VOTE in [V
              for V in DBS['ep_votes'].values()
              if 'votes' in V]
 for t in ['+','-','0']
 for g in VOTE['votes'].get(t,{'groups':{}})['groups'].values()
 for m in g
 if 'mepid' not in m}

Context: Sadly the EP publishes the plenary votes with the names of the MEPs only, so it is up to Parltrack to figure out which name maps to which UserID. Unfortunately this process is not perfect and there is gaps. To figure out all the votes and names that we were unable to resolve the above query should list them all. Some of them are weird. Definitely material to dig deeper and to maybe to ask questions to the EP about the stranger ones.

You might notice that this and the following snippets all take some time, they work on the whole dataset and thus are kind of slowish.

List Unidentified Amendment Authors

The context in this next snippet is similar to the previous one, Amendments also specify authorship fuzzily. But instead of writing a list comprehension we actually write the query out as an almost full function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from collections import Counter
unk=Counter()
for am in DBS['ep_amendments'].values():
    if len(am.get('meps',[]))!=len(am.get('authors','').split(',')):
        authors = {unws(x.strip().lower()) for x in am.get('authors','').split(',')}
        for m in am.get('meps',[]):
            name = DBS['ep_meps'][m]['Name']['full'].lower()
            if name in authors:
                 authors.remove(name)
                 continue
        for name in list(authors):
            mepid=mepid_by_name(normalize_name(name))
            if mepid:
                authors.remove(name)
                if not 'meps' in am: am['meps']=[]
                am['meps'].append(mepid)
        for a in authors:
            unk[a]+=1
print('\n'.join(["%s %s" % (k, v) for v,k  in sorted(unk.items(), key=lambda x: x[1])]))
print(sum(unk.values()))

There's a few notable things here. We use a Counter object which is a convenient way to - you guessed it - count things. We use the parltrack function unws() which stands for unwhitespace - it removes redundant whitespace from a string. And we use the normalize_name() parltrack function when attempting to look up a MEP in the mepid_by_name index.

This following snippet is not as complex as the previous one, but it is more interesting in a data-mining-kind of sense. It aggregates all groups of MEPs that have co-authored an amendment to the '2016/0280(COD)' dossier, and ranks them by the number of amendents submitted by this group.

1
2
3
4
5
6
from collections import Counter
groups = Counter()
for am in IDXs['ams_by_dossier']['2016/0280(COD)']:
    group = tuple(sorted({DBS['ep_meps'][mepid]['Name']['full'] for mepid in am['meps']}))
    groups[group]+=1
sorted(groups.items(),key=lambda x: x[1])

The following query an improved version of the previous query, it also includes the info which MEP was associated with which political group at the time of co-authoring the amendment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from collections import Counter
groups = Counter()
for am in IDXs['ams_by_dossier']['2016/0280(COD)']:
    group = set()
    for mepid in am.get('meps',[]):
        mep = DBS['ep_meps'][mepid]
        group.add((mep['Name']['full'], matchInterval(mep['Groups'], am['date'])['groupid']))
    group = tuple(sorted(group))
    groups[group]+=1
sorted(groups.items(),key=lambda x: x[1])

This one introduces a useful helper-function: matchInterval(list, date) it takes a list of objects that each has a start and an end date, the function returns the item that matches the date given as the second parameter to matchInterval() it also handles open-ended intervals where the end is set to the year 9999.

The following query is a variation of the above, but instead of focusing on only one dossier, this creates an all-time ranking:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from collections import Counter
groups = Counter()
for am in DBS['ep_amendments'].values():
    group = set()
    for mepid in am.get('meps',[]):
        if not mepid: continue
        mep = DBS['ep_meps'][mepid]
        if not mep: continue
        group.add((mep['Name']['full'], matchInterval(mep['Groups'], am['date']).get('groupid','???')))
    group = tuple(sorted(group))
    groups[group]+=1
sorted(groups.items(),key=lambda x: x[1])

The next variant adds the MEPs country and a weight to each MEP based on the number of days being in office at the time of running this query - a kind of seniority weight

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from collections import Counter
groups = Counter()
for am in DBS['ep_amendments'].values():
    group = set()
    for mepid in am.get('meps',[]):
        if not mepid: continue
        mep = DBS['ep_meps'][mepid]
        if not mep: continue
        group.add((mep['Name']['full'],
                   matchInterval(mep['Groups'], am['date']).get('groupid','???'),
                   matchInterval(mep['Constituencies'], am['date']).get('country','???'),
                   sum(((datetime.now() if c['end'] == '9999-12-31T00:00:00'
                               else datetime.strptime(c['end'], u"%Y-%m-%dT%H:%M:%S"))
                        -datetime.strptime(c['start'], u"%Y-%m-%dT%H:%M:%S")).days
                       for c in mep['Constituencies'])))
    group = tuple(sorted(group))
    groups[group]+=1
sorted(groups.items(),key=lambda x: x[1])[-150:]

Our next-to-last - and quite complex - snippet turns the whole perspective around, and gives us a list of all co-authorship groups MEP Axel Voss has been a member of:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from collections import Counter
meps = Counter()
for am in DBS['ep_amendments'].values():
    if not 96761 in am.get('meps',[]): continue
    for mepid in am.get('meps',[]):
        if not mepid: continue
        mep = DBS['ep_meps'][mepid]
        if not mep: continue
        meps[(mep['Name']['full'],
             matchInterval(mep['Groups'], am['date']).get('groupid','???'),
             matchInterval(mep['Constituencies'], am['date']).get('country','???'),
             sum(((datetime.now() if c['end'] == '9999-12-31T00:00:00'
                               else datetime.strptime(c['end'], u"%Y-%m-%dT%H:%M:%S"))
                   -datetime.strptime(c['start'], u"%Y-%m-%dT%H:%M:%S")).days
                  for c in mep['Constituencies']))] += 1
sorted(meps.items(),key=lambda x: x[1])

This last one eliminates Axel Voss from the query and creates an all-time-best-of top30 list of amendment writers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from collections import Counter
# get weighted stats on amendment authors
def getmmd(mepid, am): #get mep metadata
    mep = DBS['ep_meps'][mepid]
    if not mep: return
    return (mep['Name']['full'],
             matchInterval(mep['Groups'], am['date']).get('groupid','???'),
             matchInterval(mep['Constituencies'], am['date']).get('country','???'),
             sum(((datetime.now() if c['end'] == '9999-12-31T00:00:00'
                         else datetime.strptime(c['end'], u"%Y-%m-%dT%H:%M:%S"))
                  -datetime.strptime(c['start'], u"%Y-%m-%dT%H:%M:%S")).days
                 for c in mep['Constituencies']))

meps = Counter()
for am in DBS['ep_amendments'].values():
    for mepid in am.get('meps',[]):
        if not mepid: continue
        tmp = getmmd(mepid, am)
        if not tmp: continue
        name, group, country, days = tmp
        meps[(name,group,country,days)] += 1
stats={}
for mep, cnt in meps.items():
    days=mep[3]
    stats[mep] = (cnt/float(days), cnt, days)
sorted(stats.items(),key=lambda x: x[1])[-30:]

And this concludes our little series on using the parltrack no-db console to dig up trivia and other facts of questionable utility from the parltrack data.

"What a ride! There is so many nuggets in this dataset, it really helps uncovering obscure details of the european wurst-maschinery." wraps up this series prof. Uriah Xavier Deinhof, deputy-elect for explosive strip-mining.