java - \b in Facebook message and XML serialization -


i'm coding crawler retrieves facebook posts , serialize them xml.

my problem following: i've found messages special characters (such \b), when wrote xml serialized 

if try open xml java dom parser (with ), obtain error because not capable parse character.

how can solve it?

data examples: http://pastebin.com/3xek5qbv

the error given parser when load resulting xml is:

[fatal error] out.xml:7:59: la referencia de caracteres "&# org.xml.sax.saxparseexception; systemid: file:/z:/programas/workspace%20eclipse/workspace/test/out.xml; linenumber: 7; columnnumber: 59; la referencia de caracteres "&# @ com.sun.org.apache.xerces.internal.parsers.domparser.parse(unknown source) @ com.sun.org.apache.xerces.internal.jaxp.documentbuilderimpl.parse(unknown source) @ javax.xml.parsers.documentbuilder.parse(unknown source) @ test.loadbadxml(test.java:43) @ test.(test.java:32) @ test.main(test.java:139)

about source code i've 3 related source codes:

first one: obtaining "malformed (with \b)" data json facebook:

// post object contains "post" // url_base_graph, , token constants contains strings necessary create url facebook graph api // idpost id of post i'm retrieving  string urlstr = url_base_graph + idpost + "?access_token=" + token; url url = new url(urlstr); objectmapper om = new objectmapper(); jsonnode root = om.readvalue(url.openstream(), jsonnode.class); ...     jsonnode message = root.get("message"); if (message != null) {         post.setmessage(message.astext()); } 

second one: writing data xml:

// outfile file written                 file file = new file(outfile);                 documentbuilderfactory docfactory = documentbuilderfactory                                 .newinstance();                 documentbuilder docbuilder = docfactory.newdocumentbuilder();                  // root elements                 document doc = docbuilder.newdocument();                 element rootelement = doc.createelement("groups");                 doc.appendchild(rootelement);                  ....                  if (post.getmessage() != null) {                         element messagepost = doc.createelement("post_message");                         // i've tried this: messagepost.appendchild(doc.createtextnode(stringescapeutils.escapexml(post.getmessage())));                         messagepost.appendchild(doc.createtextnode(post.getmessage()));                         postel.appendchild(messagepost);                 }                  ....                  transformerfactory transformerfactory = transformerfactory.newinstance();                 transformer transformer = transformerfactory.newtransformer();                 transformer.setoutputproperty(outputkeys.indent, "yes");                 transformer.setoutputproperty("{http://xml.apache.org/xslt}indent-amount", "2");                 domsource source = new domsource(doc);                 streamresult result = new streamresult(file);                 transformer.transform(source, result); 

third one: loading again xml (malformed ) xml:

 file fxmlfile = new file(f);                 documentbuilderfactory dbfactory = documentbuilderfactory.newinstance();                 documentbuilder dbuilder = dbfactory.newdocumentbuilder();                 document doc = dbuilder.parse(fxmlfile);                 doc.getdocumentelement().normalize();                 ....                 node pstnode = postnode.item(j);                 if (pstnode.getnodetype() == node.element_node) {                         element pstelement = (element) pstnode;                         string pstmessage = null;                         if (pstelement.getelementsbytagname("post_message").item(0) != null)                                 pstmessage = pstelement.getelementsbytagname("post_message").item(0).gettextcontent(); 

any thoughts?

thanks!

scraping facebook against it's automated data collection terms. besides that, there's api that.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -