java - \b in Facebook message and XML serialization -
i'm coding crawler retrieves facebook posts , serialize them xml.
my problem following: i've found messages special characters (such \b), when wrote xml serialized 
if try open xml java dom parser (with 
), obtain error because not capable parse character.
how can solve it?
data examples: http://pastebin.com/3xek5qbv
the error given parser when load resulting xml is:
[fatal error] out.xml:7:59: la referencia de caracteres "&# org.xml.sax.saxparseexception; systemid: file:/z:/programas/workspace%20eclipse/workspace/test/out.xml; linenumber: 7; columnnumber: 59; la referencia de caracteres "&# @ com.sun.org.apache.xerces.internal.parsers.domparser.parse(unknown source) @ com.sun.org.apache.xerces.internal.jaxp.documentbuilderimpl.parse(unknown source) @ javax.xml.parsers.documentbuilder.parse(unknown source) @ test.loadbadxml(test.java:43) @ test.(test.java:32) @ test.main(test.java:139)
about source code i've 3 related source codes:
first one: obtaining "malformed (with \b)" data json facebook:
// post object contains "post" // url_base_graph, , token constants contains strings necessary create url facebook graph api // idpost id of post i'm retrieving string urlstr = url_base_graph + idpost + "?access_token=" + token; url url = new url(urlstr); objectmapper om = new objectmapper(); jsonnode root = om.readvalue(url.openstream(), jsonnode.class); ... jsonnode message = root.get("message"); if (message != null) { post.setmessage(message.astext()); }
second one: writing data xml:
// outfile file written file file = new file(outfile); documentbuilderfactory docfactory = documentbuilderfactory .newinstance(); documentbuilder docbuilder = docfactory.newdocumentbuilder(); // root elements document doc = docbuilder.newdocument(); element rootelement = doc.createelement("groups"); doc.appendchild(rootelement); .... if (post.getmessage() != null) { element messagepost = doc.createelement("post_message"); // i've tried this: messagepost.appendchild(doc.createtextnode(stringescapeutils.escapexml(post.getmessage()))); messagepost.appendchild(doc.createtextnode(post.getmessage())); postel.appendchild(messagepost); } .... transformerfactory transformerfactory = transformerfactory.newinstance(); transformer transformer = transformerfactory.newtransformer(); transformer.setoutputproperty(outputkeys.indent, "yes"); transformer.setoutputproperty("{http://xml.apache.org/xslt}indent-amount", "2"); domsource source = new domsource(doc); streamresult result = new streamresult(file); transformer.transform(source, result);
third one: loading again xml (malformed 
) xml:
file fxmlfile = new file(f); documentbuilderfactory dbfactory = documentbuilderfactory.newinstance(); documentbuilder dbuilder = dbfactory.newdocumentbuilder(); document doc = dbuilder.parse(fxmlfile); doc.getdocumentelement().normalize(); .... node pstnode = postnode.item(j); if (pstnode.getnodetype() == node.element_node) { element pstelement = (element) pstnode; string pstmessage = null; if (pstelement.getelementsbytagname("post_message").item(0) != null) pstmessage = pstelement.getelementsbytagname("post_message").item(0).gettextcontent();
any thoughts?
thanks!
scraping facebook against it's automated data collection terms. besides that, there's api that.
Comments
Post a Comment