Skip to content

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Closed
@kenballus

Description

@kenballus

Background

RFC 3986 defines a scheme like this:

  • scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

RFC 2234 defines an ALPHA like this:

  • ALPHA = %x41-5A / %x61-7A

The WHATWG URL spec defines a scheme like this:

The bug

This is the scheme string parsing code from Lib/urllib/parse.py:462-468:

    i = url.find(':')
    if i > 0:
        for c in url[:i]:
            if c not in scheme_chars:
                break
        else:
            scheme, url = url[:i].lower(), url[i+1:]

This is the definition of scheme_chars from Lib/urllib/parse.py:77-80:

scheme_chars = ('abcdefghijklmnopqrstuvwxyz'
                'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                '0123456789'
                '+-.')

This will erroneously validate schemes that begin with any of ('.', '-', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'). This behavior is in violation of both specifications.

This bug is reproducible with the following snippet:

>>> from urllib.parse import urlparse
>>> urlparse(".://") # Should error, but doesn't
ParseResult(scheme='.', netloc='', path='', params='', query='', fragment='')

My environment

  • CPython versions tested on:
  • Operating system and architecture:
    • Arch Linux x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions