regex python with unicode (japanese) character issue -

August 15, 2012

i want remove part of string (shown in bold) below, stored in string oldstring

[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday

im using following regex within python

p=re.compile(ur"( [\w]+) (?=[a-za-z ]+–)", re.unicode) newstring=p.sub("", oldstring)

when output newstring nothing has been removed

you can use following snippet solve issue:

#!/usr/bin/python # -*- coding: utf-8 -*- import re str = u'[dmsm-8433] 加護亜依 kago ai – 加護亜依 vs. friday' regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[a-za-z ]+–)' p = re.compile(regex, re.u) match = p.sub("", str) print match.encode("utf-8")

see ideone demo

beside # -*- coding: utf-8 -*- declaration, have added @nhahtdh's character class detect japanese symbols.

note match needs encoded utf-8 string "manually" since python 2 needs "reminded" working unicode time.

Search This Blog

TSQL

regex python with unicode (japanese) character issue -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -