Unicode safe find using boost and standard C++ -
consider following snippet:
namespace bl = boost::locale; static bl::generator gen; static auto loc = gen("en_us.utf-8"); std::string foo8 = u8"föo"; std::string deco = bl::normalize(foo8,bl::norm_nfd,loc); std::string comp = bl::normalize(foo8,bl::norm_nfc,loc); std::cout << "decomposed: " << deco.find("o") << ", composed: " << comp.find("o") <<"\n";
this gives: "decomposed: 1, composed: 3".
now, correct answer depends on collation factor, cases latter want -- first location of o, not first part of decomposed ö. example can normalize string nfc ensure desired result, won't work cases grapheme-cluster can't composed.
further, x.find("ö") have implementation defined behavior, there no guarantees how ö encoded in search.
i can implement unicode safe find function implementing algorithm in uax 29, or normalizing search strings, i'm wondering if there way using c++ std library , boost -- perhaps combining locale string algorithm -- haven't found solution.
anyone have definitive answer? i'm aware use icu, , boost::locale c++ friendly wrapper around icu library (at least if want full unicode support).
further,
x.find("ö")
have implementation defined behavior, there no guarentees how ö encoded in search.
sadly, there isn't can this. client of api have ensure call using u8
prefix , argument normalized. 1 write find
function normalizes input prior searching, there's no way mitigate ambiguity in encoding.
i can implement unicode safe find function implementing algorithm in uax 29
there's no need implement since implemented boost.locales segment_index
.
i'm wondering if there way using c++ std library , boost -- perhaps combining locale string algorithm -- haven't found solution.
the standard library borderline useless , far know boost.locale doesn't have string search facilities. icu's string search functionality uses notion of canonical equivalence , that's best bet.
Comments
Post a Comment