Speed up nested dictionary reading/parsing

This post is also available in: en

Let's use the next dummy data:

nuke = {'124765108': {'cusipAliasRecord': [{'code': 8888,
    'identifier': '0003dc0049f84c22',
    'orgid': 8888,
    'purpose': 'REUQID'},
   {'code': 42, 'identifier': '124765108', 'orgid': 42, 'purpose': 'CUSIP'},
   {'code': 70, 'identifier': 'CA1247651088', 'orgid': 70, 'purpose': 'ISIN'},
   {'code': 28613,
    'identifier': 'CA1247651088',
    'orgid': 28613,
    'purpose': 'CHL'},
   {'code': 284, 'identifier': 'CAE', 'orgid': 284, 'purpose': 'CONRIC'}]},
 '302081104': {'cusipAliasRecord': [{'code': 8888,
    'identifier': '0003dd04eddc32d9',
    'orgid': 8888,
    'purpose': 'REUQID'},
   {'code': 42, 'identifier': '302081104', 'orgid': 42, 'purpose': 'CUSIP'},
   {'code': 284, 'identifier': 'EXLS.O', 'orgid': 284, 'purpose': 'CONRIC'},
   {'code': 70, 'identifier': 'US3020811044', 'orgid': 70, 'purpose': 'ISIN'},
   {'code': 28613,
    'identifier': 'US3020811044',
    'orgid': 28613,
    'purpose': 'CHL'},
   {'code': 28774,
    'identifier': 'US3020811044',
    'orgid': 28774,
    'purpose': 'SIMC'}]}}

This is a nested dictionary, that he wanted to transform to a plain table, one row per entry as follows:

     code        identifier  orgid purpose      cusip
0    8888  0003dc0049f84c22   8888  REUQID  124765108
1      42         124765108     42   CUSIP  124765108
2      70      CA1247651088     70    ISIN  124765108
3   28613      CA1247651088  28613     CHL  124765108
4     284               CAE    284  CONRIC  124765108
5    8888  0003dd04eddc32d9   8888  REUQID  302081104
6      42         302081104     42   CUSIP  302081104
7     284            EXLS.O    284  CONRIC  302081104
8      70      US3020811044     70    ISIN  302081104
9   28613      US3020811044  28613     CHL  302081104
10  28774      US3020811044  28774    SIMC  302081104

The initial solution wasn't fast at all:

def secmasunnest(data,dataset,datarecord):
    df=pd.DataFrame()
    for x in range(0,len(data)):
        cusip = data[dataset].index.values[x]
        try:
            a = pd.DataFrame(data[dataset][x][datarecord])
            a.insert(loc=0, column="Cusip", value=cusip)
        except Exception:
            a = pd.DataFrame()
        df = df.append(a)
    df.insert(loc=0, column="Fecha_Insert", value=str(fecha))
    return df

This function was call on every iteration. Nothing bad per se, except that calling pd.DataFrame on every iteration it is not fast at all! (plus some extra complications).

Using a list and dict comprehension that can be later converted to a pandas DataFrame it should be orders of magnitude faster.

So using a couple of lines (literally), this can be solved in a very fast way:

import pandas as pd

all = [ {**v, **{'cusip': k}} for k in nuke for v in nuke[k]['cusipAliasRecord'] ]
pd.DataFrame(all)

And so, you only call Pandas once. The ** works in the same way as **kwargs for passing named function arguments, but in this case can be used inside a dictionary (from Python 3.5 and on). In this case, it allows to concatenate dictionaries, so v contains all the data from a row (as dict) for code, identifier, orgid and purpose, directly extracted from the original dictionary, and then we append to those records the cusip value of k. Very pytonic way to do it!

Speed up nested dictionary reading/parsing

Published

Category

Tags

Contacto