Removing Elements From 300MG Xml In Python / Element Tree -
i'm trying parse 300mb xml in elementtree, based on advise can python xml elementtree parse large xml file?
from xml.etree import elementtree et event, elem in et.iterparse('c:\...path...\desc2015.xml'): if elem.tag == 'descriptorrecord': e in elem._children: if str(e.tag) in ['datecreated', 'year', 'month', 'treenumber', 'historynote', 'previousindexing']: e.clear() elem.remove(e) print 'removed %s' % e
giving...
removed <element 'historynote' @ 0x557cc7f0> removed <element 'datecreated' @ 0x557fa990> removed <element 'historynote' @ 0x55809af0> removed <element 'datecreated' @ 0x5580f5d0>
however, keeps going, file isn't getting smaller, , on inspection elements still there. tried either e.clear() or elem.remove(e), same results. regards
update
error code first comment on @alexanderlukanin13 s answer:
traceback (most recent call last): file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1570, in trace_dispatch traceback (most recent call last): file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 2278, in globals = debugger.run(setup['file'], none, none) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1704, in run pydev_imports.execfile(file, globals, locals) # execute script file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 234, in main() file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # note: still doesn't return proper value. file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 835, in main pydevtestrunner(configuration).run_tests() file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 762, in run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 517, in find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 476, in __get_module_from_str buf_err = pydevd_io.startredirect(keep_original_redirection=true, std='stderr') file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd_io.py", line 72, in startredirect import sys memoryerror
the main problem in script don't save altered xml disk. need store reference root element , call elementtree.write:
from xml.etree import elementtree et context = et.iterparse('input.xml') root = none event, elem in context: if elem.tag == 'descriptorrecord': e in list(elem.getchildren()): # don't use _children, it's private field if e.tag in ['datecreated', 'year', 'month', 'treenumber', 'historynote', 'previousindexing']: elem.remove(e) # need remove(), not clear() root = elem open('output.xml', 'wb') file: et.elementtree(root).write(file, encoding='utf-8', xml_declaration=true)
note: here use awkward (and unsafe) way root element - assume it's last element in iterparse
output. if knows better way, please tell.
Comments
Post a Comment