regex python with unicode (japanese) character issue -
i want remove part of string (shown in bold) below, stored in string oldstring
[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday
im using following regex within python
p=re.compile(ur"( [\w]+) (?=[a-za-z ]+–)", re.unicode) newstring=p.sub("", oldstring)
when output newstring nothing has been removed
you can use following snippet solve issue:
#!/usr/bin/python # -*- coding: utf-8 -*- import re str = u'[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday' regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[a-za-z ]+–)' p = re.compile(regex, re.u) match = p.sub("", str) print match.encode("utf-8")
see ideone demo
beside # -*- coding: utf-8 -*-
declaration, have added @nhahtdh's character class detect japanese symbols.
note match
needs encoded utf-8 string "manually" since python 2 needs "reminded" working unicode time.
Comments
Post a Comment