Python unicode.splitlines() triggers at non-EOL character -

- February 15, 2010

triyng make in python 2.7:

>>> s = u"some\u2028text" >>> s u'some\u2028text' >>> l = s.splitlines(true) >>> l [u'some\u2028', u'text']

\u2028 left-to-right embedding character, not \r or \n, line should not splitted. there bug or misunderstanding?

\u2028 line separator, left-to-right embedding \u202a:

>>> import unicodedata  >>> unicodedata.name(u'\u2028') 'line separator'  >>> unicodedata.name(u'\u202a') 'left-to-right embedding'

the list of codepoints considered linebreaks easy (not easy find though) see in python source (python 2.7, comments me):

/* returns 1 unicode characters having line break  * property 'bk', 'cr', 'lf' or 'nl' or having bidirectional  * type 'b', 0 otherwise.  */ int _pyunicode_islinebreak(register const py_unicode ch) {     switch (ch) {     // basic latin     case 0x000a:    // line feed     case 0x000b:    // vertical tabulation     case 0x000c:    // form feed     case 0x000d:    // carriage return     case 0x001c:    // file separator     case 0x001d:    // group separator     case 0x001e:    // record separator      // latin-1 supplement     case 0x0085:    // next line      // general punctuation     case 0x2028:    // line separator     case 0x2029:    // paragraph separator         return 1;     }     return 0; }

Search This Blog

Sher

Python unicode.splitlines() triggers at non-EOL character -

Comments

Post a Comment

Popular posts from this blog

java - How to Configure JAXRS and Spring With Annotations -

visual studio - TFS will not accept changes I've made to a Java project -

php - Create image in codeigniter on the fly -