Regular Expressions UTF-8 matchers: Letters, Marks, Punctuation etc.

From WikiOD
Revision as of 04:11, 14 June 2021 by Admin (talk | contribs) (Text replacement - "{{note| This article is an extract of the original Stack Overflow Documentation created by contributors and released under [https://creativecommons.org/licenses/by-sa/3.0/ CC BY-SA 3.0]. This website is not affiliated with Stack Overflow }}" to "{{note| Credit:Stack_Overflow_Documentation }}")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Matching letters in different alphabets[edit | edit source]

Examples below are given in Ruby, but same matchers should be available in any modern language.

Let’s™s say we have the string "AℵNaïve", produced by Messy Artificial Intelligence. It consists of letters, but generic \w matcher won’t match much:

▶–¶ "AℵNaïve"[/\w+/]
#⇒‡’ "A"

The correct way to match Unicode letter with combining marks is to use \X to specify a grapheme cluster. There is a caveat for Ruby, though. Onigmo, the regex engine for Ruby, still uses the old definition of a grapheme cluster. It is not yet updated to Extended Grapheme Cluster as defined in Unicode Standard Annex 29.

So, for Ruby we could have a workaround: \p{L} will do almost fine, save for it fails on combined diacritical accent on i:

▶–¶ "AℵNaïve"[/\p{L}+/]
#⇒‡’ "AℵNai"

By adding the “Mark symbols” to the expression, we can finally match everything:

▶–¶ "AℵNaïve"[/[\p{L}\p{M}]+/]
#⇒‡’ "AℵNaïve"

Credit:Stack_Overflow_Documentation