Exploiting Regular Expressions
--
A regular expression (or regex) is basically a search pattern. For example, the expression [cb]at
will match both cat and bat. This isn’t a regex tutorial so if you don’t know much about regex, go through this amazing cheat sheet before reading any further.
Let’s get started with some basics anyway :)
Repetition Operators
+
is a repetition operator and matches repetition of characters, patterns or groups.
ca+t will match caaaat
There’s another repetition operator, *
. The only difference between +
and *
is that +
matches one or more while *
matches zero or more. To be clear about this,
ca*t will match both caaaat and ct
ca+t will match caaaat but not ct
Greedy & Lazy Matching
If I want to match everything between x and y, I can simply do x.*y
where .
means anything. This expression will hence match x)dw2rfy
without any problem.
Repetition operators are greedy by default. They will try to match as much as possible.
Let’s consider the above example again,
Using the expression x.*y
against the string axaayaaya
will return xaayaay
. One may expect this search to return xaay
or xaayaay
because both of these fit our condition x<anything here>y
but this is where the concept of greedy and lazy matching comes into play. By default, the expression will return the longest possible result i.e. xaayaay
but we can make it lazy or return the shortest result by using the ?
operator e.g. x.*?y
.
Backtracking
Computers are so smart that they are dumb. When you tell it to run x*y
against a string xxxxxxxxxxxxxx
, we can tell that it’s not going to match because it doesn’t contain a y at the end. But your regex engine doesn’t know that! It will do the following
xxxxxxxxxxxxxx # Didn't match
xxxxxxxxxxxxxx # Let's backtrack and go back
xxxxxxxxxxxxx # Still didn't match
xxxxxxxxxxxxx # Let's backtrack one more time
xxxxxxxxxxxx # Fuck, no luck
xxxxxxxxxxxx # Let's backtrack in hopes of matching
xxxxxxxxxxx # Okay, this isn't funny anymore
xxxxxxxxxxx # Baccccccckttraaaack>> Let's skip a few stepsxx # No match? WTF?
x # Let's backtrack
x # Nothing to match
# Nothing to backtrack to