Windows disk usage issues with python -
i executing python code follows.
i running on folder ("articles") has couple hundred subfolders , 240,226 files in all.
i timing execution. @ first times pretty stable went non-linear after 100,000 files. times (i timing @ 10,000 file intervals) can go non_linear after 30,000 or (or not).
i have task manager open , correlate slow-downs 99% disk usage python.exe. have done gc-collect(). dels etc., turned off windows indexing. have re-started windows, emptied trash (i have few hundred gbs free). nothing helps, disk usage seems getting more erratic if anything.
sorry long post - help
def get_filenames(): (dirpath, dirnames, filenames) in os.walk("articles/"): dirs.extend(dirnames) dir in dirs: path = "articles" + "\\" + dir nxml_files.extend(glob.glob(path + "/*.nxml")) return nxml_files def extract_text_from_files(nxml_files): nxml_file in nxml_files: fast_parse(nxml_file) def fast_parse(infile): file = open(infile,"r") filetext = file.read() tag_breaks = filetext.split('><') paragraphs = [tag_break.strip('p>').strip('</') tag_break in tag_breaks if tag_break.startswith('p>')] def run_files(): nxml_files = get_filenames() extract_text_from_files(nxml_files) if __name__ == "__main__": run_files()
there things optimized.
at first, open files, close them well. with open(...) name:
block easily. btw in python 2 file
bad choice variable name, built-in function's name.
you can remove 1 disc read doing string comparisons instead of glob.
and last not least: os.walk
spits out results cleverly, don't buffer them list, process inside 1 loop. save lot of memory.
that can advise code. more details on causing i/o should use profiling. see https://docs.python.org/2/library/profile.html details.
Comments
Post a Comment