This post is also available in: es
Let's use the next dummy data:
nuke = {'124765108': {'cusipAliasRecord': [{'code': 8888,
'identifier': '0003dc0049f84c22',
'orgid': 8888,
'purpose': 'REUQID'},
{'code': 42, 'identifier': '124765108', 'orgid': 42, 'purpose': 'CUSIP'},
{'code': 70, 'identifier': 'CA1247651088', 'orgid': 70, 'purpose': 'ISIN'},
{'code': 28613,
'identifier': 'CA1247651088',
'orgid': 28613,
'purpose': 'CHL'},
{'code': 284, 'identifier': 'CAE', 'orgid': 284, 'purpose': 'CONRIC'}]},
'302081104': {'cusipAliasRecord': [{'code': 8888,
'identifier': '0003dd04eddc32d9',
'orgid': 8888,
'purpose': 'REUQID'},
{'code': 42, 'identifier': '302081104', 'orgid': 42, 'purpose': 'CUSIP'},
{'code': 284, 'identifier': 'EXLS.O', 'orgid': 284, 'purpose': 'CONRIC'},
{'code': 70, 'identifier': 'US3020811044', 'orgid': 70, 'purpose': 'ISIN'},
{'code': 28613,
'identifier': 'US3020811044',
'orgid': 28613,
'purpose': 'CHL'},
{'code': 28774,
'identifier': 'US3020811044',
'orgid': 28774,
'purpose': 'SIMC'}]}}
This is a nested dictionary, that he wanted to transform to a plain table, one row per entry as follows:
code identifier orgid purpose cusip
0 8888 0003dc0049f84c22 8888 REUQID 124765108
1 42 124765108 42 CUSIP 124765108
2 70 CA1247651088 70 ISIN 124765108
3 28613 CA1247651088 28613 CHL 124765108
4 284 CAE 284 CONRIC 124765108
5 8888 0003dd04eddc32d9 8888 REUQID 302081104
6 42 302081104 42 CUSIP 302081104
7 284 EXLS.O 284 CONRIC 302081104
8 70 US3020811044 70 ISIN 302081104
9 28613 US3020811044 28613 CHL 302081104
10 28774 US3020811044 28774 SIMC 302081104
The initial solution wasn't fast at all:
def secmasunnest(data,dataset,datarecord):
df=pd.DataFrame()
for x in range(0,len(data)):
cusip = data[dataset].index.values[x]
try:
a = pd.DataFrame(data[dataset][x][datarecord])
a.insert(loc=0, column="Cusip", value=cusip)
except Exception:
a = pd.DataFrame()
df = df.append(a)
df.insert(loc=0, column="Fecha_Insert", value=str(fecha))
return df
This function was call on every iteration. Nothing bad per se, except that calling pd.DataFrame
on every iteration it is not fast at all! (plus some extra complications).
Using a list and dict comprehension that can be later converted to a pandas DataFrame it should be orders of magnitude faster.
So using a couple of lines (literally), this can be solved in a very fast way:
import pandas as pd
all = [ {**v, **{'cusip': k}} for k in nuke for v in nuke[k]['cusipAliasRecord'] ]
pd.DataFrame(all)
And so, you only call Pandas
once. The **
works in the same way as **kwargs
for passing named function arguments, but in this case can be used inside a dictionary (from Python 3.5 and on). In this case, it allows to concatenate dictionaries, so v
contains all the data from a row (as dict) for code
, identifier
, orgid
and purpose
, directly extracted from the original dictionary, and then we append to those records the cusip
value of k
. Very pytonic way to do it!