Skip to content

WIP/RFC: added parser for multipart form #427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

acdupont
Copy link
Contributor

I'll work on any issues to get this PR accepted

@codecov-io
Copy link

codecov-io commented Jun 17, 2019

Codecov Report

Merging #427 into master will increase coverage by 0.47%.
The diff coverage is 94.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #427      +/-   ##
==========================================
+ Coverage   73.59%   74.07%   +0.47%     
==========================================
  Files          34       35       +1     
  Lines        1909     1967      +58     
==========================================
+ Hits         1405     1457      +52     
- Misses        504      510       +6
Impacted Files Coverage Δ
src/HTTP.jl 55.17% <ø> (ø) ⬆️
src/multipart.jl 8.21% <75%> (+8.21%) ⬆️
src/parsemultipart.jl 96.72% <96.72%> (ø)
src/parseutils.jl 66.66% <0%> (-25.65%) ⬇️
src/StreamRequest.jl 88.46% <0%> (-3.85%) ⬇️
src/IOExtras.jl 57.14% <0%> (-3.58%) ⬇️
src/Servers.jl 56.25% <0%> (-2.68%) ⬇️
src/cookies.jl 91.19% <0%> (-1.26%) ⬇️
src/sniff.jl 92.72% <0%> (-0.97%) ⬇️
src/AWS4AuthRequest.jl 81.13% <0%> (-0.35%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c27eb0c...461a51f. Read the comment docs.

@quinnj
Copy link
Member

quinnj commented Jun 17, 2019

Thanks for picking this up @acdupont! Just let me know when you think this is ready to review/merge and I can take a look.

@rssdev10
Copy link

Hi, any progress in merging of this feature?

@acdupont
Copy link
Contributor Author

We are using the feature in our own code base, but I haven't had the time to write tests using RFCs as a reference. I am still planning on doing this.

@acdupont acdupont force-pushed the parseform branch 2 times, most recently from dae340a to 5a55f81 Compare July 23, 2019 18:06
m = match(r"^(-*)(.*)$", boundary)
m === nothing && return nothing
d, str = m[1], m[2]
find_boundaries(bytes, unsafe_wrap(Array{UInt8, 1}, String(str)), length(d); start = start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unsafe_wrap combined with the String construction looks a bit odd. Is it necessary? (I wondered if it required a GC.@preserve but it seems the resulting array knows about the string as the data owner so you should be safe there.)

Also, can you replace the regex match with a simple search findfirst(c->c != '-', boundary)? It looks match(r"^(-*)(.*)$", boundary) will always succeed except when there's a newline in boundary. Is that the intended behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added my first test this evening, I'll take a look at these two items next.

Copy link
Contributor Author

@acdupont acdupont Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a refactor that addressed the use of the regular expression and unsafe_wrap, the code is completely different. I added comments referencing RFC 2046 and RFC 7578. I am not finished refactoring though:

  • I am looking at the code that parses the "chunk" now.
  • I'd like to change the find_boundary naive string parsing to something like Knuth–Morris–Pratt algorithm (or is there a regex that can search byte arrays in Julia?)

Also, I changed the code that writes to an IO stored in the Multipart to creating an IOBuffer around a view of the body sent from the Request.
d7d54b2#diff-1be8563c7ec166272d8a3e9201b93db6R105

I point this out because I don't know if that will cause issues in real code once the Request goes away, although my test ran fine. Does that look okay?

@acdupont
Copy link
Contributor Author

@quinnj and @c42f, this PR is at a state where it can be reviewed.

Copy link
Contributor

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will certainly be useful for us, thanks!

Overall I think it could be worth moving some of the header parsing into Parsers.jl and carefully examining the relevant RFCs for the allowed forms of the header key value pairs like name="blah" and boundary=----whatever-- (seem to be called "parameters" and "attribute value pairs" in the RFCs though I am no expert). Then write+test a generic parser for this part of the header which you can then use in your multipart parser.


function Base.show(io::IO, m::Multipart{T}) where {T}
items = ["data=::$T", "contenttype=\"$(m.contenttype)\"", "contenttransferencoding=\"$(m.contenttransferencoding)\")"]
isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"")
isnothing(m.filename) || pushfirst!(items, "filename=\"$(m.filename)\"")


function writemultipartheader(io::IOBuffer, i::Multipart)
write(io, "; filename=\"$(i.filename)\"\r\n")
isnothing(i.filename) || write(io, "; filename=\"$(i.filename)\"\r\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the new i.name here? Is it ignored or handled some other way?

const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data"
const NAME_REGEX = r" name=\"(.*?)\""
const FILENAME_REGEX = r" filename=\"(.*?)\""
const CONTENTTYPE_REGEX = r"(?i)Content-Type: (\S*[^;\s])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was possible to inline these at the use site without a performance penalty (the r_str string macro compiles the regex at macro expansion time, not at runtime).

Having said that, the code in Parsers.jl does otherwise but I'm not sure why.

error("no delimiter found separating header from multipart body")
end

function chunk2Multipart(chunk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is named a little unusually. Maybe parse_multipart_chunk?

headers = String(view(chunk, startIndex:endIndex))
content = view(chunk, endIndex+1:lastindex(chunk))

occursin(FORMDATA_REGEX, headers) || return # Specifying content disposition is mandatory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what to do here for error handling. Dropping the input completely is probably the safest from a security/robustness point of view but doesn't give anything to go on when debugging.


const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data"
const NAME_REGEX = r" name=\"(.*?)\""
const FILENAME_REGEX = r" filename=\"(.*?)\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this doesn't handle name="escaped\"quote" and name=unquoted which appear to be allowed by RFC 7231. See https://tools.ietf.org/html/rfc7231#appendix-D :

parameter = token "=" ( token / quoted-string )

But I'm confused as to whether this applies directly to parsing multipart header parameters and a quick trip down the RFC rabbit hole to https://tools.ietf.org/html/rfc7578 left me a bit confused.

Is there a definitive definition of how the parameters should be encoded somewhere in an RFC?

catch exception
@error "" typeof(exception) exception
@test false
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this try-catch (@test does this for you), and you should test the actual output from show (with a regex if necessary. Eg

@test sprint(show, HTTP.Multipart(nothing, IOBuffer("some data"), "plain/text", "", "testname")) == "some-string"

also would be nice to cover a couple of other code paths in show.


@testset "constructor" begin
@testset "don't allow String for data" begin
@test_throws MethodError HTTP.Multipart(nothing, "some data", "plain/text", "", "testname")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need the nested testsets here, (unless you're going to add a bunch more tests?)



function generateTestBody()
IOBuffer("----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue\"; filename=\"multipart.txt\"\r\nContent-Type: text/plain\r\n\r\nnot much to say\n\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key1\"\r\n\r\n1\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key2\"\r\n\r\nkey the second\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue2\"; filename=\"multipart-leading-newline.txt\"\r\nContent-Type: text/plain\r\n\r\n\nfile with leading newline\n\r\n----------------------------918073721150061572809433--\r\n").data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name should be generate_test_body? Ditto for generateTestRequest, etc.

Also, can/should you use the @b_str macro here rather than the IOBuffer() trick?

@amellnik
Copy link

@acdupont Are you still working on this? If you would like, I can take a pass at the issues that @c42f mentioned.

@acdupont
Copy link
Contributor Author

@amellnik I haven't had time to get back to this, it would be awesome if you picked it up.

@c42f
Copy link
Contributor

c42f commented Oct 22, 2019

That would be fantastic @amellnik. I don't have time to work on this myself (other than providing some review), but we're using a hacked-up version of this in our own application-specific repo and it has proved useful so far.

@nlw0
Copy link

nlw0 commented Jan 3, 2020

I'm very interested in this feature, how can anyone help? Is the next step now to go trough the review?

@acdupont
Copy link
Contributor Author

acdupont commented Jan 3, 2020

The questions on August 16th from @c42f are the next iteration that need to be addressed.

@pixel27
Copy link
Contributor

pixel27 commented Jan 4, 2020

I'm looking at implementing the changes mentioned by @c42f and have some questions.

First about the regex being defined at the top of the file. Should I leave that? My own tests say that if you do an r"" at the call the regex compile is only called once, but I'm not an expert.

Second my read of https://tools.ietf.org/html/rfc2183 suggests that he was correct we should handle the non quoted file name. Which brings in a whole need to actually parse the line. Is there a minimum version of Julia HTTP should work with? I ask because there is a findnext() method that takes a character but that was only introduced in 1.3.

Lastly I'm not exactly sure how to submit my changes to acdupont's pull request....I could issue pull request to his repository or create a new pull request to this repository from my repository which was created of his repository...or is there something else? I'm not real knowledgeable about github or git for that matter....I know enough to be dangerous but that's all I claim to know.

@pixel27
Copy link
Contributor

pixel27 commented Jan 4, 2020

Oh one more thing, there was a comment that dropping the multipart while correct may make it difficult to debug. Should I add a @warn message? I see some use of @warn but they don't seem to be around malformed HTTP requests. Is there another way I should indicate that the request does not appear valid?

@c42f
Copy link
Contributor

c42f commented Jan 7, 2020

First about the regex being defined at the top of the file. Should I leave that

I would change it; if the regexes are only used once it should be clearer and just as efficient to move them to the use site.

Is there a minimum version of Julia HTTP should work with

You can find a hint in Project.toml — HTTP still claims compatibility with 0.7 so I think it's safe to say you need to be conservative for now. I think you may use the predicate form findnext(==(c), str, i) form for c::Char.

Lastly I'm not exactly sure how to submit my changes to acdupont's pull request....I could issue pull request to his repository or create a new pull request to this repository from my repository which was created of his repository...or is there something else?

I suggest it's simplest to create a new PR with your own repository and branch as the source by pulling this branch from @acdupont's repository and building on top of it in your own repo. Just put a link to this discussion on the new PR with a quick mention of the history of the code.

@quinnj
Copy link
Member

quinnj commented Feb 26, 2020

This was finished/merged in #514

@quinnj quinnj closed this Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants