i have medium sized file (~300mb) containing list of individuals (~300k) , actions performed. i'm trying apply operation each individuals using groupby
, paralellized version of apply
described here. looks this
import pandas import multiprocessing joblib import parallel, delayed df = pandas.read_csv(src) patients_table_raw = apply_parallel(df.groupby('id'), f) def applyparallel(dfgrouped, func): retlst = parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) name, group in dfgrouped) return pd.concat(retlst)
but unfortunately consumes hell lot of space. think related fact simple command:
list_groups = list(df.groupby('id'))
consumes several gb of memory! how procceed? initial thoughts iterate groupby in small 'stacks', not consuming memory (but didn't found way without casting list).
more detailed context
i have simple csv dataset in following fashion:
|-------------------------| | id | timestamp | action | |-------------------------| |1 | 0 | | |1 | 10 | b | |1 | 20 | c | |2 | 0 | b | |2 | 15 | c | ...
what i'm trying create different table contains description of sequence of actions/timestamps of individuals , ids. me retrieve individuals
|------------------| | id | description | |------------------| |1 | 0a10b20c | |2 | 0b15c | ...
in order so, , follow pythonic way, idea load first table in pandas dataframe, groupby id, , apply function in grouping returns row of table want each group (each id). however, have lots of individuals in dataset (around 1 million), , groupby operation extremely expensive (without explicit garbage collection, mentioned in own answer). also, parallelizing groupby implied in significant memory use, because apparently things duplicated.
therefore, more detailed question is: how use groupby (and therefore make data processing faster if implement loop of own) , don't huge memory overhead?
try (without parallelization):
in [87]: df out[87]: id timestamp action 0 1 0 1 1 10 b 2 1 20 c 3 2 0 b 4 2 15 c in [88]: df.set_index('id').astype(str).sum(axis=1).groupby(level=0).sum().to_frame('description').reset_index() out[88]: id description 0 1 0a10b20c 1 2 0b15c
Comments
Post a Comment