lucene, efficient way to get offsets of a set of terms in documents -

February 15, 2012

suppose have indexed set of documents , given set of terms known generated indexing process. occurrence of each of these terms, i.e., document, offsets. have done using, each term, 1 postingnums let me iterate through set of documents term appear in; within each document, 1 postingenums document vector contains offset information of term in document.

but not efficient there loop inside loop , can go quite slow. code below. suggestions if can done in better way?

the field schema:

<field name="terms" type="token_ngram" indexed="true" stored="false" multivalued="false" termvectors="true" termpositions="true" termoffsets="true"/>

code:

indexreader indexreader = ...//init index reader set<string> termset = .... //set containing e.g., 10000 terms. for(string term: termset){     //get postingenum used iterate through docs containing term     //this "postings" not have valid offset information (see comment below)     postingsenum postings =             multifields.gettermdocsenum(indexreader, "terms", new bytesref(term.getbytes()));     /*i tried:       *postingsenum postings =      *       multifields.gettermdocsenum(indexreader, "terms", new bytesref(term.getbytes()), postingsenum.offsets);      * resulting "postings" object not contain valid offset info (always -1)      */      //now go through each document     int docid = postings.nextdoc();     while (docid != postingsenum.no_more_docs) {         //get term vector document.         termsenum = indexreader.gettermvector(docid, ngraminfofieldname).iterator();         //find term of interest         it.seekexact(new bytesref(term.getbytes()));         //get posting info. contain offset info         postingsenum postingsindoc = it.postings(null, postingsenum.offsets);          //from below, line line b if replace "postingsindoc" "postings", method "posting.startoffset()" , "endoffset()" returns -1;          postingsindoc.nextdoc(); //line          int totalfreq = postingsindoc.freq();         (int = 0; < totalfreq; i++) {             postingsindoc.nextposition();             system.out.println(postingsindoc.startoffset(), postingsindoc.endoffset());         }        //line b               docid=postings.nextdoc();     } }

Search This Blog

TSQL

lucene, efficient way to get offsets of a set of terms in documents -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -