Removing Elements From 300MG Xml In Python / Element Tree -


i'm trying parse 300mb xml in elementtree, based on advise can python xml elementtree parse large xml file?

from xml.etree import elementtree et  event, elem in et.iterparse('c:\...path...\desc2015.xml'):       if elem.tag == 'descriptorrecord':         e in elem._children:             if str(e.tag) in ['datecreated', 'year', 'month', 'treenumber', 'historynote', 'previousindexing']:                 e.clear()                 elem.remove(e)                 print 'removed %s' % e 

giving...

removed <element 'historynote' @ 0x557cc7f0> removed <element 'datecreated' @ 0x557fa990> removed <element 'historynote' @ 0x55809af0> removed <element 'datecreated' @ 0x5580f5d0> 

however, keeps going, file isn't getting smaller, , on inspection elements still there. tried either e.clear() or elem.remove(e), same results. regards

update

error code first comment on @alexanderlukanin13 s answer:

traceback (most recent call last): file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1570, in trace_dispatch traceback (most recent call last): file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 2278, in globals = debugger.run(setup['file'], none, none) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1704, in run pydev_imports.execfile(file, globals, locals) # execute script file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 234, in main() file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # note: still doesn't return proper value. file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 835, in main pydevtestrunner(configuration).run_tests() file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 762, in run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 517, in find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 476, in __get_module_from_str buf_err = pydevd_io.startredirect(keep_original_redirection=true, std='stderr') file "c:\users\eddie\downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd_io.py", line 72, in startredirect import sys memoryerror

the main problem in script don't save altered xml disk. need store reference root element , call elementtree.write:

from xml.etree import elementtree et  context = et.iterparse('input.xml') root = none event, elem in context:     if elem.tag == 'descriptorrecord':         e in list(elem.getchildren()):  # don't use _children, it's private field             if e.tag in ['datecreated', 'year', 'month', 'treenumber', 'historynote', 'previousindexing']:                 elem.remove(e)  # need remove(), not clear()     root = elem  open('output.xml', 'wb') file:     et.elementtree(root).write(file, encoding='utf-8', xml_declaration=true) 

note: here use awkward (and unsafe) way root element - assume it's last element in iterparse output. if knows better way, please tell.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -