Windows disk usage issues with python -


i executing python code follows.

i running on folder ("articles") has couple hundred subfolders , 240,226 files in all.

i timing execution. @ first times pretty stable went non-linear after 100,000 files. times (i timing @ 10,000 file intervals) can go non_linear after 30,000 or (or not).

i have task manager open , correlate slow-downs 99% disk usage python.exe. have done gc-collect(). dels etc., turned off windows indexing. have re-started windows, emptied trash (i have few hundred gbs free). nothing helps, disk usage seems getting more erratic if anything.

sorry long post - help

def get_filenames():     (dirpath, dirnames, filenames) in os.walk("articles/"):         dirs.extend(dirnames)      dir in dirs:         path = "articles" + "\\" + dir                 nxml_files.extend(glob.glob(path + "/*.nxml"))      return nxml_files  def extract_text_from_files(nxml_files):       nxml_file in nxml_files:                fast_parse(nxml_file)  def fast_parse(infile):     file = open(infile,"r")     filetext = file.read()     tag_breaks = filetext.split('><')     paragraphs = [tag_break.strip('p>').strip('</') tag_break in tag_breaks if tag_break.startswith('p>')]  def run_files():      nxml_files = get_filenames()     extract_text_from_files(nxml_files)  if __name__ == "__main__":         run_files() 

there things optimized.

at first, open files, close them well. with open(...) name: block easily. btw in python 2 file bad choice variable name, built-in function's name.

you can remove 1 disc read doing string comparisons instead of glob.

and last not least: os.walk spits out results cleverly, don't buffer them list, process inside 1 loop. save lot of memory.

that can advise code. more details on causing i/o should use profiling. see https://docs.python.org/2/library/profile.html details.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -