python - Ways to search and tag a MongoDB database of academic papers -
apologies vague nature of question i'm not quite sure start , thought i'd ask here guidance.
as exercise, i've downloaded several academic papers , stored them plain text in mongodb database.
i'd write search feature (using python, r, whatever) when enter text , returns relevant articles. clearly, relevant hard -- that's google got right.
however, i'm not looking perfect. something. few thoughts had were:
1) simple mongodb full text search
2) implement lucene search
3) tag them (unsure how though) , return them sorted number of tags?
is there solution has used that's out of box , works well? can optimize search feature later -- want pieces move together...
thanks!
is there solution has used that's out of box , works well?
it depends on how define well, in simple terms, i'd no. there no single , accurate definition of fairly well. lot of challenges intrinsic particular problem arise when 1 trying implement good search algorithm. challenges lies in:
- users needs diversity. users in different fields have different intentions , result different expectation search result page;
- natural languages diversity, if trying implement multi-language search (german has lot of noun compounds, russian has enormous flexion variability etc.);
there algorithms proven work better others though, start from. tf*idf , bm25 2 popular.
i can optimize search feature later -- want pieces move together...
mongodb or rdbms fulltext indexing support enough proof-of-concept, if need optimize search performance, need inverted index (solr/lucene). solr/lucene ability manage:
- how words stemmed (this important solve undersemming/overstemming problems);
- what word is. "supercomputer" 1 word? "stackoverflow" or "outofboundsexception"?
- synonyms , word expansion (should "o2" found "oxygen" query?)
- how search performed. words ignored during search. ones required found. 1 required found near each other (think of search phrase: "not annealed" or "without expansion").
this comes mind first.
so if planning work these things out recommend lucene framework or solr/elasticsearch search system if need build proof-of-concept fast. if not, mongodb/rdms work well.
Comments
Post a Comment