regex python with unicode (japanese) character issue -


i want remove part of string (shown in bold) below, stored in string oldstring

[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday

im using following regex within python

p=re.compile(ur"( [\w]+) (?=[a-za-z ]+–)", re.unicode) newstring=p.sub("", oldstring) 

when output newstring nothing has been removed

you can use following snippet solve issue:

#!/usr/bin/python # -*- coding: utf-8 -*- import re str = u'[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday' regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[a-za-z ]+–)' p = re.compile(regex, re.u) match = p.sub("", str) print match.encode("utf-8") 

see ideone demo

beside # -*- coding: utf-8 -*- declaration, have added @nhahtdh's character class detect japanese symbols.

note match needs encoded utf-8 string "manually" since python 2 needs "reminded" working unicode time.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -