i have huge gzip
file (several gb) of tab-delimited text parse pandas dataframe.
if contents of file text, 1 use .split()
, e.g.
file_text = """abc 123 cat 456 dog 678 bird 111 fish ... moon 1969 revolution 1789 war 1927 reformation 1517 maxwell ...""" data = [line.split() line in file_text.split('\n')]
and put data pandas dataframe using
import pandas pd df = pd.dataframe(data)
however, isn't text document. tab-delimited file in gzip, several gb of data. efficient way parse data dataframe, using .split()
?
i guess first step use
import gzip gzip.open(filename, 'r') f: file_content = f.read()
and use .split()
on file_content
, saving gb single variable , splitting inefficient. possible in "chunks"?
read_csv()
supports gzip
ped files, can following:
for chunk in pd.read_csv('/path/to/file.csv.gz', sep='\s*', chunksize=10**5): # process chunk df
if sure have tsv (tab separated file), can use sep='\t'
Comments
Post a Comment