python - Pandas parallel groupBy consumes tons of memory -


i have medium sized file (~300mb) containing list of individuals (~300k) , actions performed. i'm trying apply operation each individuals using groupby , paralellized version of apply described here. looks this

import pandas import multiprocessing joblib import parallel, delayed  df = pandas.read_csv(src) patients_table_raw = apply_parallel(df.groupby('id'), f)  def applyparallel(dfgrouped, func):     retlst = parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) name, group in dfgrouped)     return pd.concat(retlst) 

but unfortunately consumes hell lot of space. think related fact simple command:

list_groups = list(df.groupby('id')) 

consumes several gb of memory! how procceed? initial thoughts iterate groupby in small 'stacks', not consuming memory (but didn't found way without casting list).

more detailed context

i have simple csv dataset in following fashion:

|-------------------------| | id | timestamp | action | |-------------------------| |1   | 0         |      | |1   | 10        | b      | |1   | 20        | c      | |2   | 0         | b      | |2   | 15        | c      |          ... 

what i'm trying create different table contains description of sequence of actions/timestamps of individuals , ids. me retrieve individuals

|------------------| | id | description | |------------------| |1   | 0a10b20c    | |2   | 0b15c       |          ... 

in order so, , follow pythonic way, idea load first table in pandas dataframe, groupby id, , apply function in grouping returns row of table want each group (each id). however, have lots of individuals in dataset (around 1 million), , groupby operation extremely expensive (without explicit garbage collection, mentioned in own answer). also, parallelizing groupby implied in significant memory use, because apparently things duplicated.

therefore, more detailed question is: how use groupby (and therefore make data processing faster if implement loop of own) , don't huge memory overhead?

try (without parallelization):

in [87]: df out[87]:    id  timestamp action 0   1          0      1   1         10      b 2   1         20      c 3   2          0      b 4   2         15      c  in [88]: df.set_index('id').astype(str).sum(axis=1).groupby(level=0).sum().to_frame('description').reset_index() out[88]:    id description 0   1    0a10b20c 1   2       0b15c 

Comments