Regular Expressions Regex Pitfalls
Why doesn't dot (.) match the newline character ("\n")?
.* in regex basically means "catch everything until the end of input".
So, for simple strings, like
.* works perfectly. But if you have a string representing, for example, lines in a file, these lines would be separated by a line separator, such as
\n (newline) on Unix-like systems and
\r\n (carriage return and newline) on Windows.
By default in most regex engines,
. doesn't match newline characters, so the matching stops at the end of each logical line. If you want
. to match really everything, including newlines, you need to enable "dot-matches-all" mode in your regex engine of choice (for example, add
re.DOTALL flag in Python, or
/s in PCRE.
Why does a regex skip some closing brackets/parentheses and match them afterwards?
Consider this example:
He went into the cafe "Dostoevski" and said: "Good evening."
Here we have two sets of quotes. Let's assume we want to match both, so that our regex matches at
At first, you could be tempted to keep it simple:
".*" # matches a quote, then any characters until the next quote
But it doesn't work: it matches from the first quote in
"Dostoevski" and until the closing quote in
"Good evening.", including the
and said: part. Regex101 demo
Why did it happen?
This happens because the regex engine, when it encounters
.*, "eats up" all of the input to the very end. Then, it needs to match the final
". So, it "backs off" from the end of the match, letting go of the matched text until the first
" is found - and it is, of course, the last
" in the match, at the end of
"Good evening." part.
How to prevent this and match exactly to the first quotes?
[^"]*. It doesn't eat all the input - only until the first
", just as needed. Regex101 demo
This article is an extract of the original Stack Overflow Documentation created by contributors and released under CC BY-SA 3.0. This website is not affiliated with Stack Overflow