python - Parallelize pandas apply -


new pandas, want parallelize row-wise apply operation. far found parallelize apply after pandas groupby however, seems work grouped data frames.

my use case different: have list of holidays , current row/date want find no-of-days before , after day next holiday.

this function call via apply:

def get_nearest_holiday(x, pivot):     nearestholiday = min(x, key=lambda x: abs(x- pivot))     difference = abs(nearesholiday - pivot)     return difference / np.timedelta64(1, 'd') 

how can speed up?

edit

i experimented bit pythons pools - neither nice code, nor did computed results.

i think going down route of trying stuff in parallel on complicating this. haven't tried approach on large sample mileage may vary, should give idea...

let's start dates...

import pandas pd  dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']) 

we'll use holiday data pandas.tseries.holiday - note in effect want datetimeindex...

from pandas.tseries.holiday import usfederalholidaycalendar  holiday_calendar = usfederalholidaycalendar() holidays = holiday_calendar.holidays('2016-01-01') 

this gives us:

datetimeindex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',                '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',                '2016-11-24', '2016-12-26',                ...                '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',                '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',                '2030-11-28', '2030-12-25'],               dtype='datetime64[ns]', length=150, freq=none) 

now find indices of nearest nearest holiday original dates using searchsorted:

indices = holidays.searchsorted(dates) # array([1, 6, 9, 3]) next_nearest = holidays[indices] # datetimeindex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=none) 

then take difference between two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days # array([15, 31, 14, 88]) 

you'll need careful indices don't wrap around, , previous date, calculation indices - 1 should act (i hope) relatively base.


Comments