Skip to content

Unable to open an S3 object with # in the URL #25945

Closed
@swt2c

Description

@swt2c
import pandas as pd
df = pd.read_csv('s3://bucket/key#1.csv')
df = pd.read_csv('s3://bucket/key%231.csv')

Problem description

Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).

I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):

  1. Use allow_fragments=False when calling urllib.parse.urlparse(). This would allow the non-percent-encoded case to work, but seems slightly wrong.
  2. Call urllib.parse.unquote() on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions