Unicode safe find using boost and standard C++ -


consider following snippet:

namespace bl = boost::locale; static bl::generator gen; static auto loc = gen("en_us.utf-8"); std::string foo8 = u8"föo"; std::string deco = bl::normalize(foo8,bl::norm_nfd,loc); std::string comp = bl::normalize(foo8,bl::norm_nfc,loc); std::cout << "decomposed: " << deco.find("o") << ", composed: " << comp.find("o") <<"\n"; 

this gives: "decomposed: 1, composed: 3".

now, correct answer depends on collation factor, cases latter want -- first location of o, not first part of decomposed ö. example can normalize string nfc ensure desired result, won't work cases grapheme-cluster can't composed.

further, x.find("ö") have implementation defined behavior, there no guarantees how ö encoded in search.

i can implement unicode safe find function implementing algorithm in uax 29, or normalizing search strings, i'm wondering if there way using c++ std library , boost -- perhaps combining locale string algorithm -- haven't found solution.

anyone have definitive answer? i'm aware use icu, , boost::locale c++ friendly wrapper around icu library (at least if want full unicode support).

further, x.find("ö") have implementation defined behavior, there no guarentees how ö encoded in search.

sadly, there isn't can this. client of api have ensure call using u8 prefix , argument normalized. 1 write find function normalizes input prior searching, there's no way mitigate ambiguity in encoding.

i can implement unicode safe find function implementing algorithm in uax 29

there's no need implement since implemented boost.locales segment_index.

i'm wondering if there way using c++ std library , boost -- perhaps combining locale string algorithm -- haven't found solution.

the standard library borderline useless , far know boost.locale doesn't have string search facilities. icu's string search functionality uses notion of canonical equivalence , that's best bet.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -