-
Notifications
You must be signed in to change notification settings - Fork 181
WIP/RFC: added parser for multipart form #427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #427 +/- ##
==========================================
+ Coverage 73.59% 74.07% +0.47%
==========================================
Files 34 35 +1
Lines 1909 1967 +58
==========================================
+ Hits 1405 1457 +52
- Misses 504 510 +6
Continue to review full report at Codecov.
|
Thanks for picking this up @acdupont! Just let me know when you think this is ready to review/merge and I can take a look. |
Hi, any progress in merging of this feature? |
We are using the feature in our own code base, but I haven't had the time to write tests using RFCs as a reference. I am still planning on doing this. |
…last 4 bytes of chunk
dae340a
to
5a55f81
Compare
src/parsemultipart.jl
Outdated
m = match(r"^(-*)(.*)$", boundary) | ||
m === nothing && return nothing | ||
d, str = m[1], m[2] | ||
find_boundaries(bytes, unsafe_wrap(Array{UInt8, 1}, String(str)), length(d); start = start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unsafe_wrap
combined with the String
construction looks a bit odd. Is it necessary? (I wondered if it required a GC.@preserve
but it seems the resulting array knows about the string as the data owner so you should be safe there.)
Also, can you replace the regex match with a simple search findfirst(c->c != '-', boundary)
? It looks match(r"^(-*)(.*)$", boundary)
will always succeed except when there's a newline in boundary
. Is that the intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added my first test this evening, I'll take a look at these two items next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a refactor that addressed the use of the regular expression and unsafe_wrap, the code is completely different. I added comments referencing RFC 2046 and RFC 7578. I am not finished refactoring though:
- I am looking at the code that parses the "chunk" now.
- I'd like to change the find_boundary naive string parsing to something like Knuth–Morris–Pratt algorithm (or is there a regex that can search byte arrays in Julia?)
Also, I changed the code that writes to an IO stored in the Multipart to creating an IOBuffer around a view of the body sent from the Request.
d7d54b2#diff-1be8563c7ec166272d8a3e9201b93db6R105
I point this out because I don't know if that will cause issues in real code once the Request goes away, although my test ran fine. Does that look okay?
make content-* regexes case insensitive make content-type regex to parse with and without semi-colon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will certainly be useful for us, thanks!
Overall I think it could be worth moving some of the header parsing into Parsers.jl
and carefully examining the relevant RFCs for the allowed forms of the header key value pairs like name="blah"
and boundary=----whatever--
(seem to be called "parameters" and "attribute value pairs" in the RFCs though I am no expert). Then write+test a generic parser for this part of the header which you can then use in your multipart parser.
|
||
function Base.show(io::IO, m::Multipart{T}) where {T} | ||
items = ["data=::$T", "contenttype=\"$(m.contenttype)\"", "contenttransferencoding=\"$(m.contenttransferencoding)\")"] | ||
isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"") | |
isnothing(m.filename) || pushfirst!(items, "filename=\"$(m.filename)\"") |
|
||
function writemultipartheader(io::IOBuffer, i::Multipart) | ||
write(io, "; filename=\"$(i.filename)\"\r\n") | ||
isnothing(i.filename) || write(io, "; filename=\"$(i.filename)\"\r\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the new i.name
here? Is it ignored or handled some other way?
const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data" | ||
const NAME_REGEX = r" name=\"(.*?)\"" | ||
const FILENAME_REGEX = r" filename=\"(.*?)\"" | ||
const CONTENTTYPE_REGEX = r"(?i)Content-Type: (\S*[^;\s])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was possible to inline these at the use site without a performance penalty (the r_str
string macro compiles the regex at macro expansion time, not at runtime).
Having said that, the code in Parsers.jl does otherwise but I'm not sure why.
error("no delimiter found separating header from multipart body") | ||
end | ||
|
||
function chunk2Multipart(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is named a little unusually. Maybe parse_multipart_chunk
?
headers = String(view(chunk, startIndex:endIndex)) | ||
content = view(chunk, endIndex+1:lastindex(chunk)) | ||
|
||
occursin(FORMDATA_REGEX, headers) || return # Specifying content disposition is mandatory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what to do here for error handling. Dropping the input completely is probably the safest from a security/robustness point of view but doesn't give anything to go on when debugging.
|
||
const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data" | ||
const NAME_REGEX = r" name=\"(.*?)\"" | ||
const FILENAME_REGEX = r" filename=\"(.*?)\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this doesn't handle name="escaped\"quote"
and name=unquoted
which appear to be allowed by RFC 7231. See https://tools.ietf.org/html/rfc7231#appendix-D :
parameter = token "=" ( token / quoted-string )
But I'm confused as to whether this applies directly to parsing multipart header parameters and a quick trip down the RFC rabbit hole to https://tools.ietf.org/html/rfc7578 left me a bit confused.
Is there a definitive definition of how the parameters should be encoded somewhere in an RFC?
catch exception | ||
@error "" typeof(exception) exception | ||
@test false | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this try-catch (@test
does this for you), and you should test the actual output from show
(with a regex if necessary. Eg
@test sprint(show, HTTP.Multipart(nothing, IOBuffer("some data"), "plain/text", "", "testname")) == "some-string"
also would be nice to cover a couple of other code paths in show
.
|
||
@testset "constructor" begin | ||
@testset "don't allow String for data" begin | ||
@test_throws MethodError HTTP.Multipart(nothing, "some data", "plain/text", "", "testname") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't need the nested testsets here, (unless you're going to add a bunch more tests?)
|
||
|
||
function generateTestBody() | ||
IOBuffer("----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue\"; filename=\"multipart.txt\"\r\nContent-Type: text/plain\r\n\r\nnot much to say\n\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key1\"\r\n\r\n1\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key2\"\r\n\r\nkey the second\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue2\"; filename=\"multipart-leading-newline.txt\"\r\nContent-Type: text/plain\r\n\r\n\nfile with leading newline\n\r\n----------------------------918073721150061572809433--\r\n").data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name should be generate_test_body
? Ditto for generateTestRequest
, etc.
Also, can/should you use the @b_str
macro here rather than the IOBuffer()
trick?
@amellnik I haven't had time to get back to this, it would be awesome if you picked it up. |
That would be fantastic @amellnik. I don't have time to work on this myself (other than providing some review), but we're using a hacked-up version of this in our own application-specific repo and it has proved useful so far. |
I'm very interested in this feature, how can anyone help? Is the next step now to go trough the review? |
The questions on August 16th from @c42f are the next iteration that need to be addressed. |
I'm looking at implementing the changes mentioned by @c42f and have some questions. First about the regex being defined at the top of the file. Should I leave that? My own tests say that if you do an r"" at the call the regex compile is only called once, but I'm not an expert. Second my read of https://tools.ietf.org/html/rfc2183 suggests that he was correct we should handle the non quoted file name. Which brings in a whole need to actually parse the line. Is there a minimum version of Julia HTTP should work with? I ask because there is a findnext() method that takes a character but that was only introduced in 1.3. Lastly I'm not exactly sure how to submit my changes to acdupont's pull request....I could issue pull request to his repository or create a new pull request to this repository from my repository which was created of his repository...or is there something else? I'm not real knowledgeable about github or git for that matter....I know enough to be dangerous but that's all I claim to know. |
Oh one more thing, there was a comment that dropping the multipart while correct may make it difficult to debug. Should I add a @warn message? I see some use of @warn but they don't seem to be around malformed HTTP requests. Is there another way I should indicate that the request does not appear valid? |
I would change it; if the regexes are only used once it should be clearer and just as efficient to move them to the use site.
You can find a hint in Project.toml — HTTP still claims compatibility with 0.7 so I think it's safe to say you need to be conservative for now. I think you may use the predicate form
I suggest it's simplest to create a new PR with your own repository and branch as the source by pulling this branch from @acdupont's repository and building on top of it in your own repo. Just put a link to this discussion on the new PR with a quick mention of the history of the code. |
This was finished/merged in #514 |
I'll work on any issues to get this PR accepted