WIP/RFC: added parser for multipart form #427

acdupont · 2019-06-17T14:41:58Z

Rebase @piever parseform branch from PR WIP/RFC: added parser for multipart form #264 to HTTP.jl master
Add few tweaks

I'll work on any issues to get this PR accepted

codecov-io · 2019-06-17T15:05:38Z

Codecov Report

Merging #427 into master will increase coverage by 0.47%.
The diff coverage is 94.2%.

@@            Coverage Diff             @@
##           master     #427      +/-   ##
==========================================
+ Coverage   73.59%   74.07%   +0.47%     
==========================================
  Files          34       35       +1     
  Lines        1909     1967      +58     
==========================================
+ Hits         1405     1457      +52     
- Misses        504      510       +6

Impacted Files	Coverage Δ
src/HTTP.jl	`55.17% <ø> (ø)`	⬆️
src/multipart.jl	`8.21% <75%> (+8.21%)`	⬆️
src/parsemultipart.jl	`96.72% <96.72%> (ø)`
src/parseutils.jl	`66.66% <0%> (-25.65%)`	⬇️
src/StreamRequest.jl	`88.46% <0%> (-3.85%)`	⬇️
src/IOExtras.jl	`57.14% <0%> (-3.58%)`	⬇️
src/Servers.jl	`56.25% <0%> (-2.68%)`	⬇️
src/cookies.jl	`91.19% <0%> (-1.26%)`	⬇️
src/sniff.jl	`92.72% <0%> (-0.97%)`	⬇️
src/AWS4AuthRequest.jl	`81.13% <0%> (-0.35%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c27eb0c...461a51f. Read the comment docs.

quinnj · 2019-06-17T20:21:33Z

Thanks for picking this up @acdupont! Just let me know when you think this is ready to review/merge and I can take a look.

rssdev10 · 2019-07-18T06:45:07Z

Hi, any progress in merging of this feature?

acdupont · 2019-07-18T18:23:52Z

We are using the feature in our own code base, but I haven't had the time to write tests using RFCs as a reference. I am still planning on doing this.

…last 4 bytes of chunk

c42f · 2019-08-06T04:44:29Z

src/parsemultipart.jl

+    m =  match(r"^(-*)(.*)$", boundary)
+    m === nothing && return nothing
+    d, str = m[1], m[2]
+    find_boundaries(bytes, unsafe_wrap(Array{UInt8, 1}, String(str)), length(d); start = start)


This unsafe_wrap combined with the String construction looks a bit odd. Is it necessary? (I wondered if it required a GC.@preserve but it seems the resulting array knows about the string as the data owner so you should be safe there.)

Also, can you replace the regex match with a simple search findfirst(c->c != '-', boundary)? It looks match(r"^(-*)(.*)$", boundary) will always succeed except when there's a newline in boundary. Is that the intended behavior?

I added my first test this evening, I'll take a look at these two items next.

I pushed a refactor that addressed the use of the regular expression and unsafe_wrap, the code is completely different. I added comments referencing RFC 2046 and RFC 7578. I am not finished refactoring though:

I am looking at the code that parses the "chunk" now.

I'd like to change the find_boundary naive string parsing to something like Knuth–Morris–Pratt algorithm (or is there a regex that can search byte arrays in Julia?)

Also, I changed the code that writes to an IO stored in the Multipart to creating an IOBuffer around a view of the body sent from the Request.
d7d54b2#diff-1be8563c7ec166272d8a3e9201b93db6R105

I point this out because I don't know if that will cause issues in real code once the Request goes away, although my test ran fine. Does that look okay?

make content-* regexes case insensitive make content-type regex to parse with and without semi-colon

acdupont · 2019-08-15T18:37:12Z

@quinnj and @c42f, this PR is at a state where it can be reviewed.

c42f

This will certainly be useful for us, thanks!

Overall I think it could be worth moving some of the header parsing into Parsers.jl and carefully examining the relevant RFCs for the allowed forms of the header key value pairs like name="blah" and boundary=----whatever-- (seem to be called "parameters" and "attribute value pairs" in the RFCs though I am no expert). Then write+test a generic parser for this part of the header which you can then use in your multipart parser.

c42f · 2019-08-16T02:05:59Z

src/multipart.jl

+
+function Base.show(io::IO, m::Multipart{T}) where {T}
+    items = ["data=::$T", "contenttype=\"$(m.contenttype)\"", "contenttransferencoding=\"$(m.contenttransferencoding)\")"]
+    isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"")


Suggested change

isnothing(m.filename) || insert!(items, 1, "filename=\"$(m.filename)\"")

isnothing(m.filename) || pushfirst!(items, "filename=\"$(m.filename)\"")

c42f · 2019-08-16T02:12:14Z

src/multipart.jl


 function writemultipartheader(io::IOBuffer, i::Multipart)
-    write(io, "; filename=\"$(i.filename)\"\r\n")
+    isnothing(i.filename) || write(io, "; filename=\"$(i.filename)\"\r\n")


What happens to the new i.name here? Is it ignored or handled some other way?

c42f · 2019-08-16T02:15:00Z

src/parsemultipart.jl

+const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data"
+const NAME_REGEX = r" name=\"(.*?)\""
+const FILENAME_REGEX = r" filename=\"(.*?)\""
+const CONTENTTYPE_REGEX = r"(?i)Content-Type: (\S*[^;\s])"


I thought it was possible to inline these at the use site without a performance penalty (the r_str string macro compiles the regex at macro expansion time, not at runtime).

Having said that, the code in Parsers.jl does otherwise but I'm not sure why.

c42f · 2019-08-16T02:17:13Z

src/parsemultipart.jl

+    error("no delimiter found separating header from multipart body")
+end
+
+function chunk2Multipart(chunk)


This function is named a little unusually. Maybe parse_multipart_chunk?

c42f · 2019-08-16T03:45:20Z

src/parsemultipart.jl

+    headers = String(view(chunk, startIndex:endIndex))
+    content = view(chunk, endIndex+1:lastindex(chunk))
+
+    occursin(FORMDATA_REGEX, headers) || return # Specifying content disposition is mandatory


I wonder what to do here for error handling. Dropping the input completely is probably the safest from a security/robustness point of view but doesn't give anything to go on when debugging.

c42f · 2019-08-16T03:46:21Z

src/parsemultipart.jl

+
+const FORMDATA_REGEX = r"(?i)Content-Disposition: form-data"
+const NAME_REGEX = r" name=\"(.*?)\""
+const FILENAME_REGEX = r" filename=\"(.*?)\""


It looks like this doesn't handle name="escaped\"quote" and name=unquoted which appear to be allowed by RFC 7231. See https://tools.ietf.org/html/rfc7231#appendix-D :

parameter = token "=" ( token / quoted-string )

But I'm confused as to whether this applies directly to parsing multipart header parameters and a quick trip down the RFC rabbit hole to https://tools.ietf.org/html/rfc7578 left me a bit confused.

Is there a definitive definition of how the parameters should be encoded somewhere in an RFC?

c42f · 2019-08-16T03:53:04Z

test/multipart.jl

+		catch exception
+			@error "" typeof(exception) exception
+			@test false
+		end


You don't need this try-catch (@test does this for you), and you should test the actual output from show (with a regex if necessary. Eg

@test sprint(show, HTTP.Multipart(nothing, IOBuffer("some data"), "plain/text", "", "testname")) == "some-string"

also would be nice to cover a couple of other code paths in show.

c42f · 2019-08-16T03:56:47Z

test/multipart.jl

+
+	@testset "constructor" begin
+		@testset "don't allow String for data" begin
+			@test_throws MethodError HTTP.Multipart(nothing, "some data", "plain/text", "", "testname")


Probably don't need the nested testsets here, (unless you're going to add a bunch more tests?)

c42f · 2019-08-16T04:02:23Z

test/parsemultipart.jl

+
+
+function generateTestBody()
+    IOBuffer("----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue\"; filename=\"multipart.txt\"\r\nContent-Type: text/plain\r\n\r\nnot much to say\n\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key1\"\r\n\r\n1\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key2\"\r\n\r\nkey the second\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue2\"; filename=\"multipart-leading-newline.txt\"\r\nContent-Type: text/plain\r\n\r\n\nfile with leading newline\n\r\n----------------------------918073721150061572809433--\r\n").data


Function name should be generate_test_body? Ditto for generateTestRequest, etc.

Also, can/should you use the @b_str macro here rather than the IOBuffer() trick?

amellnik · 2019-10-17T16:13:08Z

@acdupont Are you still working on this? If you would like, I can take a pass at the issues that @c42f mentioned.

acdupont · 2019-10-17T16:26:00Z

@amellnik I haven't had time to get back to this, it would be awesome if you picked it up.

c42f · 2019-10-22T01:20:51Z

That would be fantastic @amellnik. I don't have time to work on this myself (other than providing some review), but we're using a hacked-up version of this in our own application-specific repo and it has proved useful so far.

nlw0 · 2020-01-03T08:04:15Z

I'm very interested in this feature, how can anyone help? Is the next step now to go trough the review?

acdupont · 2020-01-03T15:37:59Z

The questions on August 16th from @c42f are the next iteration that need to be addressed.

pixel27 · 2020-01-04T03:52:25Z

I'm looking at implementing the changes mentioned by @c42f and have some questions.

First about the regex being defined at the top of the file. Should I leave that? My own tests say that if you do an r"" at the call the regex compile is only called once, but I'm not an expert.

Second my read of https://tools.ietf.org/html/rfc2183 suggests that he was correct we should handle the non quoted file name. Which brings in a whole need to actually parse the line. Is there a minimum version of Julia HTTP should work with? I ask because there is a findnext() method that takes a character but that was only introduced in 1.3.

Lastly I'm not exactly sure how to submit my changes to acdupont's pull request....I could issue pull request to his repository or create a new pull request to this repository from my repository which was created of his repository...or is there something else? I'm not real knowledgeable about github or git for that matter....I know enough to be dangerous but that's all I claim to know.

pixel27 · 2020-01-04T03:54:58Z

Oh one more thing, there was a comment that dropping the multipart while correct may make it difficult to debug. Should I add a @warn message? I see some use of @warn but they don't seem to be around malformed HTTP requests. Is there another way I should indicate that the request does not appear valid?

c42f · 2020-01-07T08:15:50Z

First about the regex being defined at the top of the file. Should I leave that

I would change it; if the regexes are only used once it should be clearer and just as efficient to move them to the use site.

Is there a minimum version of Julia HTTP should work with

You can find a hint in Project.toml — HTTP still claims compatibility with 0.7 so I think it's safe to say you need to be conservative for now. I think you may use the predicate form findnext(==(c), str, i) form for c::Char.

Lastly I'm not exactly sure how to submit my changes to acdupont's pull request....I could issue pull request to his repository or create a new pull request to this repository from my repository which was created of his repository...or is there something else?

I suggest it's simplest to create a new PR with your own repository and branch as the source by pulling this branch from @acdupont's repository and building on top of it in your own repo. Just put a link to this discussion on the new PR with a quick mention of the history of the code.

quinnj · 2020-02-26T06:08:51Z

This was finished/merged in #514

acdupont force-pushed the parseform branch from 2b9b9c1 to 416c7fd Compare July 4, 2019 16:27

acdupont force-pushed the parseform branch from 416c7fd to 9555b5d Compare July 23, 2019 17:49

Pietro Vertechi and others added 14 commits July 23, 2019 13:49

added parser for multipart form

2d58b37

switch to strings

eb686a1

check for emptiness

670bc11

added manual ways of checking for boundary and content

293e777

rename

d785425

reuse multipart type

ab1969a

name regexes

58c27bb

make sure to distinguish name and filename

ad8d453

seek to beginning of stream after writing data

6331293

remove remove_trailing function, remove CRLF-- delimiter by removing …

4455369

…last 4 bytes of chunk

update regex to parse name correctly when a file is uploaded

54e6a37

add Base.seekstart(Multipart) function

a314040

fix indentation on writemultipartheader

90219f1

reformat runtests.jl so tests are easy to comment out

b220b8a

acdupont force-pushed the parseform branch 2 times, most recently from dae340a to 5a55f81 Compare July 23, 2019 18:06

acdupont added 2 commits August 3, 2019 10:38

add checks for filename isnothing

be8f52a

fix outer Multipart constructor

f9b7707

acdupont force-pushed the parseform branch from 736dbb9 to f9b7707 Compare August 3, 2019 14:50

c42f reviewed Aug 6, 2019

View reviewed changes

acdupont added 3 commits August 6, 2019 00:58

add test for parse_multipart_form, add RFC comments

a0c599c

remove unsafe_wrap

1ffaa2e

use isnothing

356cbcd

acdupont force-pushed the parseform branch from fcbc9ba to b020869 Compare August 15, 2019 04:01

refactor boundary parsing, add references to RFC 2046 and RFC 7578

3be6c66

acdupont force-pushed the parseform branch from b020869 to 3be6c66 Compare August 15, 2019 04:04

acdupont added 6 commits August 15, 2019 00:11

remove call to write io

d7d54b2

add test with text in file that starts with a newline

f34839a

change find_boundaries to find_multipart_boundaries

0e1cd91

fix parsing header from multipart

8585976

use @info printout in runtests

597ff4a

update multipart regexes

461a51f

make content-* regexes case insensitive make content-type regex to parse with and without semi-colon

c42f reviewed Aug 16, 2019

View reviewed changes

quinnj mentioned this pull request Dec 6, 2019

WIP/RFC: added parser for multipart form #264

Closed

This was referenced Jan 10, 2020

WIP/RFC: added parser for multipart form with updates #495

Closed

WIP/RFC: added parser for multipart form with updates #496

Closed

quinnj closed this Feb 26, 2020

	isnothing(m.filename) \|\| insert!(items, 1, "filename=\"$(m.filename)\"")
	isnothing(m.filename) \|\| pushfirst!(items, "filename=\"$(m.filename)\"")



		function generateTestBody()
		IOBuffer("----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue\"; filename=\"multipart.txt\"\r\nContent-Type: text/plain\r\n\r\nnot much to say\n\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key1\"\r\n\r\n1\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"key2\"\r\n\r\nkey the second\r\n----------------------------918073721150061572809433\r\nContent-Disposition: form-data; name=\"namevalue2\"; filename=\"multipart-leading-newline.txt\"\r\nContent-Type: text/plain\r\n\r\n\nfile with leading newline\n\r\n----------------------------918073721150061572809433--\r\n").data

WIP/RFC: added parser for multipart form #427

WIP/RFC: added parser for multipart form #427

Uh oh!

Conversation

acdupont commented Jun 17, 2019

Uh oh!

codecov-io commented Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

quinnj commented Jun 17, 2019

Uh oh!

rssdev10 commented Jul 18, 2019

Uh oh!

acdupont commented Jul 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acdupont Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acdupont commented Aug 15, 2019

Uh oh!

c42f left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amellnik commented Oct 17, 2019

Uh oh!

acdupont commented Oct 17, 2019

Uh oh!

c42f commented Oct 22, 2019

Uh oh!

nlw0 commented Jan 3, 2020

Uh oh!

acdupont commented Jan 3, 2020

Uh oh!

pixel27 commented Jan 4, 2020

Uh oh!

pixel27 commented Jan 4, 2020

Uh oh!

c42f commented Jan 7, 2020

Uh oh!

quinnj commented Feb 26, 2020

Uh oh!

Uh oh!

codecov-io commented Jun 17, 2019 •

edited

Loading

acdupont Aug 15, 2019 •

edited

Loading