[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters

# Background
RFC 3986 defines a scheme like this:
- `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`

RFC 2234 defines an ALPHA like this:
- `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
- "A URL-scheme string must be one [ASCII alpha](https://infra.spec.whatwg.org/#ascii-alpha), followed by zero or more of [ASCII alphanumeric](https://infra.spec.whatwg.org/#ascii-alphanumeric), U+002B (+), U+002D (-), and U+002E (.)."

# The bug
This is the scheme string parsing code from `Lib/urllib/parse.py:462-468`:
```python3
    i = url.find(':')
    if i > 0:
        for c in url[:i]:
            if c not in scheme_chars:
                break
        else:
            scheme, url = url[:i].lower(), url[i+1:]
```
This is the definition of `scheme_chars` from `Lib/urllib/parse.py:77-80`:
```python3
scheme_chars = ('abcdefghijklmnopqrstuvwxyz'
                'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                '0123456789'
                '+-.')
```
This will erroneously validate schemes that begin with any of `('.', '-', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9')`. This behavior is in violation of both specifications.

This bug is reproducible with the following snippet:
```python
>>> from urllib.parse import urlparse
>>> urlparse(".://") # Should error, but doesn't
ParseResult(scheme='.', netloc='', path='', params='', query='', fragment='')
```

# My environment
- CPython versions tested on:
  - 3.12.0a1+ (fb844e1931bc1ad2f11565fbe25627a1a41b4203)
  - 3.10.8
- Operating system and architecture:
  - Arch Linux x86_64


* PR: gh-99421




* PR: gh-99446

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Background

The bug

My environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Description

Background

The bug

My environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions