i have huge gzip file (several gb) of tab-delimited text parse pandas dataframe. 
if contents of file text, 1 use .split(), e.g. 
file_text = """abc   123   cat   456   dog   678   bird   111   fish   ... moon   1969    revolution    1789   war   1927   reformation    1517    maxwell   ..."""  data = [line.split() line in file_text.split('\n')] and put data pandas dataframe using
import pandas pd df = pd.dataframe(data) however, isn't text document. tab-delimited file in gzip, several gb of data. efficient way parse data dataframe, using .split()? 
i guess first step use
import gzip gzip.open(filename, 'r') f:     file_content = f.read() and use .split() on file_content, saving gb single variable , splitting inefficient. possible in "chunks"? 
read_csv() supports gzipped files, can following:
for chunk in pd.read_csv('/path/to/file.csv.gz', sep='\s*', chunksize=10**5):     # process chunk df if sure have tsv (tab separated file), can use sep='\t'
Comments
Post a Comment