Python unicode.splitlines() triggers at non-EOL character -
triyng make in python 2.7:
>>> s = u"some\u2028text" >>> s u'some\u2028text' >>> l = s.splitlines(true) >>> l [u'some\u2028', u'text']
\u2028
left-to-right embedding character, not \r
or \n
, line should not splitted. there bug or misunderstanding?
\u2028
line separator, left-to-right embedding \u202a
:
>>> import unicodedata >>> unicodedata.name(u'\u2028') 'line separator' >>> unicodedata.name(u'\u202a') 'left-to-right embedding'
the list of codepoints considered linebreaks easy (not easy find though) see in python source (python 2.7, comments me):
/* returns 1 unicode characters having line break * property 'bk', 'cr', 'lf' or 'nl' or having bidirectional * type 'b', 0 otherwise. */ int _pyunicode_islinebreak(register const py_unicode ch) { switch (ch) { // basic latin case 0x000a: // line feed case 0x000b: // vertical tabulation case 0x000c: // form feed case 0x000d: // carriage return case 0x001c: // file separator case 0x001d: // group separator case 0x001e: // record separator // latin-1 supplement case 0x0085: // next line // general punctuation case 0x2028: // line separator case 0x2029: // paragraph separator return 1; } return 0; }
Comments
Post a Comment