Skip to content

WIP/RFC: added parser for multipart form #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

WIP/RFC: added parser for multipart form #264

wants to merge 8 commits into from

Conversation

piever
Copy link
Contributor

@piever piever commented Jul 1, 2018

This is an attempt to solve #263. I've tested with a simple form with a textbox and a file input (I updated a PNG) and it seems to work fine (still need to add tests though).

The "content" of each section add a newline and a couple of dashes at the end that I'm removing (\r\n--, though the type of newline may vary I guess), though maybe I'm not 100% correct in how I do so (on whether they should be removed at all): I'm taking out the match of r"[\r\n]*-*$"

@piever
Copy link
Contributor Author

piever commented Jul 1, 2018

I'm also not sure whether I can be certain that req.body::Array{UInt8} or if it could also happen to a be a String or something else (and also whether I should be working with a String or Vector{UInt8}: string seems to cause problems with "continuation bytes").

@piever
Copy link
Contributor Author

piever commented Jul 1, 2018

Actually, it seems that it's fine to work with strings, the continuation byte is only a problem in Julia 0.6

julia> v = [0x89, 0x00]
2-element Array{UInt8,1}:
 0x89
 0x00

julia> s = String(v)
2-byte String of invalid UTF-8 data:
 0x89
 0x00

julia> SubString(s, 1, 2)
ERROR: ArgumentError: invalid SubString index

whereas in Julia 0.7 this works fine,

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks like the right direction. Thanks for working on this! I left a few comments along the way.

body = String(body)
dict = Dict(
"name" => String[],
"filename" => Union{String, Void}[],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Nothing instead of Void

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so I should be coding with just Julia 0.7 in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. HTTP.jl has stayed master-compatible since 0.6 branched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that at this point, with 0.7 beta out, feature addition PRs for HTTP.jl could(should?) be 0.7-only. i.e. we should put julia 0.7 in REQUIRE soon anyway.

end

function parse_multipart_form(req::Request)
m = match(r"multipart/form-data; boundary=(.*)$", req["Content-Type"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw a KeyError if the Request doesn't have a Content-Type header; it seems like we'd rather return nothing instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that req[str] returns "" if the key is missing, so this should work fine.

end

parse_multipart_body(body::AbstractArray{UInt8}, boundary) =
parse_multipart_body(String(body), boundary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in 0.7 this is destructive; i.e. w/ body as a Vector{UInt8}, it will be turned into a Vector{UInt8}(0) after the String is created. Not saying that's necessarily bad, but if someone tried to access body again after calling this, it would not be what they expected. We should at least have docs that clearly spelled this behavior out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, at least the name should have a !, will try to run this code in 0.7 (I still haven't transitioned)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the alternative? String(copy(body)) ?


function parse_multipart_body(body, boundary)
body = String(body)
dict = Dict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like my personal preference would be to return a Vector{Multipart} where Multipart was a simple struct like:

struct Multipart
    name::String
    filename::Union{Nothing, String}
    contenttype::Union{Nothing, String}
    content::Vector{UInt8}
end

It just seems like it would be a bit simpler on the server side if I could just call this function and iterate over the multparts, dealing w/ each as needed, instead of needing to do something like:

for i in 1:length(multiparts)
    nm = multiparts["name"][i]
    filename = multiparts["filename"][i]
    # etc....
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a Multipart struct which I'd like to reuse, except it stores the content as IO object, whereas for me a Array (actually SubArray) of UInt8 would be better. Should I simply write my SubArray to a IOBuffer and put that instead?

@@ -0,0 +1,52 @@
function parse_multipart_chunk!(chunk, d)
re = r"(\n){2,}|(\r){2,}|[\r\n]{4,}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little clearer to put all the regexes as top-level const BOUNDARY_REGEX = r"(\n){2,}..., so it's clear they're const and they have more descriptive names.

push!(d["contenttype"], v[3])
else
v = match(r"Content-Disposition: form-data; name=\"(.*)\"", description)
v == nothing && return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This v == nothing && return makes me uncomfortable because we've basically gotten into a bad state and we're just returning. I think we probably need to read up on the spec to see exactly what is and isn't required in terms of the Content-Disposition: form-data; tags and make sure we're following that. Is name _always required? What if just name and Content-Type are present? Seems like we want to make sure we're accounting for what the spec allows here.

str = Parsers.nextbytes(re, body)
while Parsers.exec(re, str)
chunk = str[1:re.ovec[1]]
isempty(chunk) || parse_multipart_chunk!(chunk, dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario where isempty(chunk) is true? It seems like if Parsers.exec(re, str) returns true, then it indeed matched a boundary, and yet this is saying that we didn't find the boundary? Maybe the regex needs a tweak here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If re matches the beginning of the string, then I guess we'd have str[1:re.ovec[1]] == "" ? But I agree that this should actually not happen in practice and parse_multipart_chunk! should probably still work when given an empty string.

"content" => Any[]
)
re = Regex(boundary)
Parsers.exec(re, body) || return nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If theParsers module is only being used for exec, nextbytes etc then this file should just do include("parseutils.jl") instead of saying Parsers.exec(....

@samoconnor
Copy link
Contributor

A few things it would be good to have:

  • Description (in docstrings) of what this PR implements in terms of the RFC(s).
  • Description of what parts of the RFC(s) that the implementation does not cover (if any).
  • Tests written to cover whatever behaviour is specified in the RFC(s).
  • User API docstrings and examples.
  • Quick comparison with how other HTTP libraries do this API.

@piever
Copy link
Contributor Author

piever commented Jul 3, 2018

Yes, I still need to add tests and documentation.

I'm definitely not an expert on internet standards, is there some good source where I could see what I should expect to be present or not in a multipart form body?

@samoconnor
Copy link
Contributor

@piever
Copy link
Contributor Author

piever commented Jul 6, 2018

It's probably better to develop this only for 0.7 also because string processing seems to have change in ways I'm not sure I understand. When using:

HTTP.Parsers.exec(re, body)

I keep getting a ERROR: LoadError: PCRE.exec error: UTF-8 error: isolated byte with 0x80 bit set which I'm not sure how to get rid of. If needed I can upload a file with a saved body of an example request that generates this error. Is exec(re, body) the preferred way to separate the body in chunks according to the boundary or am I missing some better alternative?

EDIT: this is only a concern in the regexes to identify the "file content", which is the part where irregular characters may occur, so it would not be too bad to do that part by hand (it's just splitting by boundary first and then by two returns).

@codecov-io
Copy link

codecov-io commented Jul 8, 2018

Codecov Report

Merging #264 into master will decrease coverage by 3.35%.
The diff coverage is 4.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
- Coverage   71.84%   68.49%   -3.36%     
==========================================
  Files          35       36       +1     
  Lines        1964     2028      +64     
==========================================
- Hits         1411     1389      -22     
- Misses        553      639      +86
Impacted Files Coverage Δ
src/HTTP.jl 77.27% <ø> (ø) ⬆️
src/parsemultipart.jl 0% <0%> (ø)
src/Parsers.jl 96.72% <100%> (ø) ⬆️
src/multipart.jl 65.57% <66.66%> (-1.1%) ⬇️
src/debug.jl 14.28% <0%> (-7.15%) ⬇️
src/compat.jl 73.68% <0%> (-5.27%) ⬇️
src/Servers.jl 69.23% <0%> (-4.95%) ⬇️
src/Handlers.jl 91.11% <0%> (-4.45%) ⬇️
src/ConnectionPool.jl 74.22% <0%> (-3.56%) ⬇️
src/Streams.jl 83.19% <0%> (-0.85%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5630bf4...a833cef. Read the comment docs.

@piever
Copy link
Contributor Author

piever commented Jul 8, 2018

Still needs tests and docs but should have addressed most things. Now I no longer convert to strings before parsing but work directly with vectors of bytes, as otherwise it screws up if the UInt8 encoding of the file has invalid characters.

In terms of header handling, I though the following seemed best according to the RFC linked above (especially https://tools.ietf.org/html/rfc7578):

  • The header has to say Content-Disposition: form-data
  • The name field is necessary
  • Content-Type is not necessary: omitting it defaults to text/plain
  • filename is not necessary

The main difference with the Multipart type as implemented originally in this package are:

  • Multipart has no name field: I've added it and I think it should be considered in writemultipartheader function
  • Multipart assumes that data is of type IO: so far I've simply put my SubArray{UInt8} in a IOBuffer
  • Multipart assumes that filename is necessary whereas I don't think it should be, especially because some fields of the form may not be files
  • there is some inconsistency about how to handle missing fields: I've been using nothing and HTTP uses "", maybe it's simpler to always use "" esp. for backward compatibility
  • unlike Multipart struct I don't really have the field Content-Transfer-Encoding: and looking at https://tools.ietf.org/html/rfc7578 it seems like it's deprecated, so I'm not sure what to do about that

I'm curious to know what's the best way to proceed on the above points.

@samoconnor samoconnor mentioned this pull request Jul 10, 2018
@acdupont
Copy link
Contributor

Is this PR abandoned?

@quinnj
Copy link
Member

quinnj commented Dec 6, 2019

Superseded by #427

@quinnj quinnj closed this Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants