python - Analyzing huge dataset from 80GB file with only 32GB of RAM -
i have huge text file ( around 80g ). file contains numbers(integers+floats) , has 20 columns. have analyze each column. analyze mean, have basic calculations on each column finding mean, plotting histograms, check if condition satisfied or not etc. reading file following
with open(filename) original_file: all_rows = [[float(digit) digit in line.split()] line in original_file] all_rows = np.asarray(all_rows)
after analysis on specific columns. use 'good' configuration server/workstation (with 32gb ram) execute program. problem not able finish job. waited day finish program still running after 1 day. had kill manually later on. know script correct without error because have tried same script on smaller size files (around 1g) , worked nicely.
my initial guess have memory problem. there way can run such job? different method or other way ?
i tried splitting files smaller file size , analyzed them individually in loop follows
pre_name = "split_file" k in range(11): #there 10 files 8g each filename = pre_name+str(k).zfill(3) #my files in form "split_file000, split_file001 ..." open(filename) original_file: all_rows = [[float(digit) digit in line.split()] line in original_file] all_rows = np.asarray(all_rows) #some analysis here plt.hist(all_rows[:,8],100) #plotting histogram 9th column all_rows = none
i have tested above code on bunch of smaller files , works fine. again same problem when used on big files. suggestions? there other cleaner way ?
for such lengthy operations (when data don't fit in memory), might useful use libraries dask ( http://dask.pydata.org/en/latest/ ), particularly dask.dataframe.read_csv
read data , perform operations in pandas library (another useful package mention).
Comments
Post a Comment