-
Notifications
You must be signed in to change notification settings - Fork 181
WIP/RFC: added parser for multipart form #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm also not sure whether I can be certain that |
Actually, it seems that it's fine to work with strings, the continuation byte is only a problem in Julia 0.6 julia> v = [0x89, 0x00]
2-element Array{UInt8,1}:
0x89
0x00
julia> s = String(v)
2-byte String of invalid UTF-8 data:
0x89
0x00
julia> SubString(s, 1, 2)
ERROR: ArgumentError: invalid SubString index whereas in Julia 0.7 this works fine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like the right direction. Thanks for working on this! I left a few comments along the way.
src/parsemultipart.jl
Outdated
body = String(body) | ||
dict = Dict( | ||
"name" => String[], | ||
"filename" => Union{String, Void}[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Nothing
instead of Void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so I should be coding with just Julia 0.7 in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. HTTP.jl has stayed master-compatible since 0.6 branched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that at this point, with 0.7 beta out, feature addition PRs for HTTP.jl could(should?) be 0.7-only. i.e. we should put julia 0.7
in REQUIRE
soon anyway.
src/parsemultipart.jl
Outdated
end | ||
|
||
function parse_multipart_form(req::Request) | ||
m = match(r"multipart/form-data; boundary=(.*)$", req["Content-Type"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw a KeyError if the Request
doesn't have a Content-Type
header; it seems like we'd rather return nothing
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that req[str]
returns ""
if the key is missing, so this should work fine.
src/parsemultipart.jl
Outdated
end | ||
|
||
parse_multipart_body(body::AbstractArray{UInt8}, boundary) = | ||
parse_multipart_body(String(body), boundary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in 0.7 this is destructive; i.e. w/ body
as a Vector{UInt8}
, it will be turned into a Vector{UInt8}(0)
after the String is created. Not saying that's necessarily bad, but if someone tried to access body
again after calling this, it would not be what they expected. We should at least have docs that clearly spelled this behavior out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, at least the name should have a !
, will try to run this code in 0.7 (I still haven't transitioned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the alternative? String(copy(body))
?
src/parsemultipart.jl
Outdated
|
||
function parse_multipart_body(body, boundary) | ||
body = String(body) | ||
dict = Dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like my personal preference would be to return a Vector{Multipart}
where Multipart
was a simple struct like:
struct Multipart
name::String
filename::Union{Nothing, String}
contenttype::Union{Nothing, String}
content::Vector{UInt8}
end
It just seems like it would be a bit simpler on the server side if I could just call this function and iterate over the multparts, dealing w/ each as needed, instead of needing to do something like:
for i in 1:length(multiparts)
nm = multiparts["name"][i]
filename = multiparts["filename"][i]
# etc....
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is a Multipart
struct which I'd like to reuse, except it stores the content as IO
object, whereas for me a Array
(actually SubArray
) of UInt8
would be better. Should I simply write my SubArray
to a IOBuffer
and put that instead?
src/parsemultipart.jl
Outdated
@@ -0,0 +1,52 @@ | |||
function parse_multipart_chunk!(chunk, d) | |||
re = r"(\n){2,}|(\r){2,}|[\r\n]{4,}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a little clearer to put all the regexes as top-level const BOUNDARY_REGEX = r"(\n){2,}...
, so it's clear they're const and they have more descriptive names.
src/parsemultipart.jl
Outdated
push!(d["contenttype"], v[3]) | ||
else | ||
v = match(r"Content-Disposition: form-data; name=\"(.*)\"", description) | ||
v == nothing && return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This v == nothing && return
makes me uncomfortable because we've basically gotten into a bad state and we're just returning. I think we probably need to read up on the spec to see exactly what is and isn't required in terms of the Content-Disposition: form-data;
tags and make sure we're following that. Is name
_always required? What if just name
and Content-Type
are present? Seems like we want to make sure we're accounting for what the spec allows here.
src/parsemultipart.jl
Outdated
str = Parsers.nextbytes(re, body) | ||
while Parsers.exec(re, str) | ||
chunk = str[1:re.ovec[1]] | ||
isempty(chunk) || parse_multipart_chunk!(chunk, dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the scenario where isempty(chunk)
is true? It seems like if Parsers.exec(re, str)
returns true, then it indeed matched a boundary, and yet this is saying that we didn't find the boundary? Maybe the regex needs a tweak here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If re
matches the beginning of the string, then I guess we'd have str[1:re.ovec[1]] == ""
? But I agree that this should actually not happen in practice and parse_multipart_chunk!
should probably still work when given an empty string.
src/parsemultipart.jl
Outdated
"content" => Any[] | ||
) | ||
re = Regex(boundary) | ||
Parsers.exec(re, body) || return nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If theParsers
module is only being used for exec
, nextbytes
etc then this file should just do include("parseutils.jl")
instead of saying Parsers.exec(...
.
A few things it would be good to have:
|
Yes, I still need to add tests and documentation. I'm definitely not an expert on internet standards, is there some good source where I could see what I should expect to be present or not in a multipart form body? |
It's probably better to develop this only for 0.7 also because string processing seems to have change in ways I'm not sure I understand. When using:
I keep getting a EDIT: this is only a concern in the regexes to identify the "file content", which is the part where irregular characters may occur, so it would not be too bad to do that part by hand (it's just splitting by boundary first and then by two returns). |
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
- Coverage 71.84% 68.49% -3.36%
==========================================
Files 35 36 +1
Lines 1964 2028 +64
==========================================
- Hits 1411 1389 -22
- Misses 553 639 +86
Continue to review full report at Codecov.
|
Still needs tests and docs but should have addressed most things. Now I no longer convert to strings before parsing but work directly with vectors of bytes, as otherwise it screws up if the In terms of header handling, I though the following seemed best according to the RFC linked above (especially https://tools.ietf.org/html/rfc7578):
The main difference with the
I'm curious to know what's the best way to proceed on the above points. |
Is this PR abandoned? |
Superseded by #427 |
This is an attempt to solve #263. I've tested with a simple form with a textbox and a file input (I updated a PNG) and it seems to work fine (still need to add tests though).
The "content" of each section add a newline and a couple of dashes at the end that I'm removing (
\r\n--
, though the type of newline may vary I guess), though maybe I'm not 100% correct in how I do so (on whether they should be removed at all): I'm taking out the match ofr"[\r\n]*-*$"