Regular Expressions Regex Pitfalls

Revision as of 04:16, 14 June 2021 by Admin (talk | contribs) (Text replacement - "{{note| This article is an extract of the original Stack Overflow Documentation created by contributors and released under [ CC BY-SA 3.0]. This website is not affiliated with Stack Overflow }}" to "{{note| Credit:Stack_Overflow_Documentation }}")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Why doesn't dot (.) match the newline character ("\n")?Edit

.* in regex basically means "catch everything until the end of input".

So, for simple strings, like hello world, .* works perfectly. But if you have a string representing, for example, lines in a file, these lines would be separated by a line separator, such as \n (newline) on Unix-like systems and \r\n (carriage return and newline) on Windows.

By default in most regex engines, . doesn't match newline characters, so the matching stops at the end of each logical line. If you want . to match really everything, including newlines, you need to enable "dot-matches-all" mode in your regex engine of choice (for example, add re.DOTALL flag in Python, or /s in PCRE.

Why does a regex skip some closing brackets/parentheses and match them afterwards?Edit

Consider this example:

He went into the cafe "Dostoevski" and said: "Good evening."

Here we have two sets of quotes. Let's assume we want to match both, so that our regex matches at "Dostoevski" and "Good evening."

At first, you could be tempted to keep it simple:

".*"  # matches a quote, then any characters until the next quote

But it doesn't work: it matches from the first quote in "Dostoevski" and until the closing quote in "Good evening.", including the and said: part. Regex101 demo

Why did it happen?Edit

This happens because the regex engine, when it encounters .*, "eats up" all of the input to the very end. Then, it needs to match the final ". So, it "backs off" from the end of the match, letting go of the matched text until the first " is found - and it is, of course, the last " in the match, at the end of "Good evening." part.

How to prevent this and match exactly to the first quotes?Edit

Use [^"]*. It doesn't eat all the input - only until the first ", just as needed. Regex101 demo