Regular Expressions UTF-8 matchers: Letters, Marks, Punctuation etc.

From WikiOD

Matching letters in different alphabets[edit | edit source]

Examples below are given in Ruby, but same matchers should be available in any modern language.

Let’s™s say we have the string "AℵNaïve", produced by Messy Artificial Intelligence. It consists of letters, but generic \w matcher won’t match much:

▶–¶ "AℵNaïve"[/\w+/]
#⇒‡’ "A"

The correct way to match Unicode letter with combining marks is to use \X to specify a grapheme cluster. There is a caveat for Ruby, though. Onigmo, the regex engine for Ruby, still uses the old definition of a grapheme cluster. It is not yet updated to Extended Grapheme Cluster as defined in Unicode Standard Annex 29.

So, for Ruby we could have a workaround: \p{L} will do almost fine, save for it fails on combined diacritical accent on i:

▶–¶ "AℵNaïve"[/\p{L}+/]
#⇒‡’ "AℵNai"

By adding the “Mark symbols” to the expression, we can finally match everything:

▶–¶ "AℵNaïve"[/[\p{L}\p{M}]+/]
#⇒‡’ "AℵNaïve"