python - Ways to search and tag a MongoDB database of academic papers -


apologies vague nature of question i'm not quite sure start , thought i'd ask here guidance.

as exercise, i've downloaded several academic papers , stored them plain text in mongodb database.

i'd write search feature (using python, r, whatever) when enter text , returns relevant articles. clearly, relevant hard -- that's google got right.

however, i'm not looking perfect. something. few thoughts had were:

1) simple mongodb full text search

2) implement lucene search

3) tag them (unsure how though) , return them sorted number of tags?

is there solution has used that's out of box , works well? can optimize search feature later -- want pieces move together...

thanks!

is there solution has used that's out of box , works well?

it depends on how define well, in simple terms, i'd no. there no single , accurate definition of fairly well. lot of challenges intrinsic particular problem arise when 1 trying implement good search algorithm. challenges lies in:

  • users needs diversity. users in different fields have different intentions , result different expectation search result page;
  • natural languages diversity, if trying implement multi-language search (german has lot of noun compounds, russian has enormous flexion variability etc.);

there algorithms proven work better others though, start from. tf*idf , bm25 2 popular.

i can optimize search feature later -- want pieces move together...

mongodb or rdbms fulltext indexing support enough proof-of-concept, if need optimize search performance, need inverted index (solr/lucene). solr/lucene ability manage:

  • how words stemmed (this important solve undersemming/overstemming problems);
  • what word is. "supercomputer" 1 word? "stackoverflow" or "outofboundsexception"?
  • synonyms , word expansion (should "o2" found "oxygen" query?)
  • how search performed. words ignored during search. ones required found. 1 required found near each other (think of search phrase: "not annealed" or "without expansion").

this comes mind first.

so if planning work these things out recommend lucene framework or solr/elasticsearch search system if need build proof-of-concept fast. if not, mongodb/rdms work well.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -