lucene, efficient way to get offsets of a set of terms in documents -
suppose have indexed set of documents , given set of terms known generated indexing process. occurrence of each of these terms, i.e., document, offsets. have done using, each term, 1 postingnums let me iterate through set of documents term appear in; within each document, 1 postingenums document vector contains offset information of term in document.
but not efficient there loop inside loop , can go quite slow. code below. suggestions if can done in better way?
the field schema:
<field name="terms" type="token_ngram" indexed="true" stored="false" multivalued="false" termvectors="true" termpositions="true" termoffsets="true"/>
code:
indexreader indexreader = ...//init index reader set<string> termset = .... //set containing e.g., 10000 terms. for(string term: termset){ //get postingenum used iterate through docs containing term //this "postings" not have valid offset information (see comment below) postingsenum postings = multifields.gettermdocsenum(indexreader, "terms", new bytesref(term.getbytes())); /*i tried: *postingsenum postings = * multifields.gettermdocsenum(indexreader, "terms", new bytesref(term.getbytes()), postingsenum.offsets); * resulting "postings" object not contain valid offset info (always -1) */ //now go through each document int docid = postings.nextdoc(); while (docid != postingsenum.no_more_docs) { //get term vector document. termsenum = indexreader.gettermvector(docid, ngraminfofieldname).iterator(); //find term of interest it.seekexact(new bytesref(term.getbytes())); //get posting info. contain offset info postingsenum postingsindoc = it.postings(null, postingsenum.offsets); //from below, line line b if replace "postingsindoc" "postings", method "posting.startoffset()" , "endoffset()" returns -1; postingsindoc.nextdoc(); //line int totalfreq = postingsindoc.freq(); (int = 0; < totalfreq; i++) { postingsindoc.nextposition(); system.out.println(postingsindoc.startoffset(), postingsindoc.endoffset()); } //line b docid=postings.nextdoc(); } }
Comments
Post a Comment