python - Fastest way to match substring from large dict -
i have (usually < 300 symbols length) string 'aabbccdcabcbbacdaaa'.
there python dictionary keys strings in similar format, e.g. 'bcccd', key length varies 10 100 symbols. dictionary has half million items.
i need match initial string dictionary's value or find out there no proper values in dictionary. matching condition: dictionary key should somewhere within string (strict matching).
what best way, in terms of computational speed, it? feel there should tricky way hash initial string , dictionary keys apply clever ways of substring search (like rabin-karp or knuth-morris-pratt). or suffix tree-like structure solution?
def search(string, dict_search): # if 2 lines expensive, calculate them , pass arguments max_key = max(len(x) x in dict_search) min_key = min(len(x) x in dict_search) return set( string[x:x+i] in range(min_key, max_key+1) x in range(len(string)-i+1) if string[x:x+i] in dict_search ) running:
>>> search('aabbccdcabcbbacdaaa', {'aaa', 'acd', 'adb', 'bccd', 'cbbb', 'abc'}) {'aaa', 'abc', 'acd', 'bccd'}
Comments
Post a Comment