WIP/RFC: added parser for multipart form #264

piever · 2018-07-01T19:35:17Z

This is an attempt to solve #263. I've tested with a simple form with a textbox and a file input (I updated a PNG) and it seems to work fine (still need to add tests though).

The "content" of each section add a newline and a couple of dashes at the end that I'm removing (\r\n--, though the type of newline may vary I guess), though maybe I'm not 100% correct in how I do so (on whether they should be removed at all): I'm taking out the match of r"[\r\n]*-*$"

piever · 2018-07-01T19:38:11Z

I'm also not sure whether I can be certain that req.body::Array{UInt8} or if it could also happen to a be a String or something else (and also whether I should be working with a String or Vector{UInt8}: string seems to cause problems with "continuation bytes").

piever · 2018-07-01T22:59:06Z

Actually, it seems that it's fine to work with strings, the continuation byte is only a problem in Julia 0.6

julia> v = [0x89, 0x00]
2-element Array{UInt8,1}:
 0x89
 0x00

julia> s = String(v)
2-byte String of invalid UTF-8 data:
 0x89
 0x00

julia> SubString(s, 1, 2)
ERROR: ArgumentError: invalid SubString index

whereas in Julia 0.7 this works fine,

quinnj

Overall looks like the right direction. Thanks for working on this! I left a few comments along the way.

quinnj · 2018-07-02T11:40:55Z

src/parsemultipart.jl

+    body = String(body)
+    dict = Dict(
+        "name" => String[],
+        "filename" => Union{String, Void}[],


Use Nothing instead of Void

I see, so I should be coding with just Julia 0.7 in mind?

Definitely. HTTP.jl has stayed master-compatible since 0.6 branched.

I would say that at this point, with 0.7 beta out, feature addition PRs for HTTP.jl could(should?) be 0.7-only. i.e. we should put julia 0.7 in REQUIRE soon anyway.

quinnj · 2018-07-02T11:42:43Z

src/parsemultipart.jl

+end
+
+function parse_multipart_form(req::Request)
+    m = match(r"multipart/form-data; boundary=(.*)$", req["Content-Type"])


This will throw a KeyError if the Request doesn't have a Content-Type header; it seems like we'd rather return nothing instead?

It seems that req[str] returns "" if the key is missing, so this should work fine.

quinnj · 2018-07-02T11:44:54Z

src/parsemultipart.jl

+end
+
+parse_multipart_body(body::AbstractArray{UInt8}, boundary) =
+    parse_multipart_body(String(body), boundary)


Note that in 0.7 this is destructive; i.e. w/ body as a Vector{UInt8}, it will be turned into a Vector{UInt8}(0) after the String is created. Not saying that's necessarily bad, but if someone tried to access body again after calling this, it would not be what they expected. We should at least have docs that clearly spelled this behavior out.

Agreed, at least the name should have a !, will try to run this code in 0.7 (I still haven't transitioned)

What's the alternative? String(copy(body)) ?

quinnj · 2018-07-02T11:48:57Z

src/parsemultipart.jl

+
+function parse_multipart_body(body, boundary)
+    body = String(body)
+    dict = Dict(


I feel like my personal preference would be to return a Vector{Multipart} where Multipart was a simple struct like:

struct Multipart name::String filename::Union{Nothing, String} contenttype::Union{Nothing, String} content::Vector{UInt8} end

It just seems like it would be a bit simpler on the server side if I could just call this function and iterate over the multparts, dealing w/ each as needed, instead of needing to do something like:

for i in 1:length(multiparts) nm = multiparts["name"][i] filename = multiparts["filename"][i] # etc.... end

There already is a Multipart struct which I'd like to reuse, except it stores the content as IO object, whereas for me a Array (actually SubArray) of UInt8 would be better. Should I simply write my SubArray to a IOBuffer and put that instead?

quinnj · 2018-07-02T11:51:06Z

src/parsemultipart.jl

@@ -0,0 +1,52 @@
+function parse_multipart_chunk!(chunk, d)
+    re = r"(\n){2,}|(\r){2,}|[\r\n]{4,}"


It might be a little clearer to put all the regexes as top-level const BOUNDARY_REGEX = r"(\n){2,}..., so it's clear they're const and they have more descriptive names.

quinnj · 2018-07-02T11:55:15Z

src/parsemultipart.jl

+        push!(d["contenttype"], v[3])
+    else
+        v = match(r"Content-Disposition: form-data; name=\"(.*)\"", description)
+        v == nothing && return


This v == nothing && return makes me uncomfortable because we've basically gotten into a bad state and we're just returning. I think we probably need to read up on the spec to see exactly what is and isn't required in terms of the Content-Disposition: form-data; tags and make sure we're following that. Is name _always required? What if just name and Content-Type are present? Seems like we want to make sure we're accounting for what the spec allows here.

quinnj · 2018-07-02T11:57:41Z

src/parsemultipart.jl

+    str = Parsers.nextbytes(re, body)
+    while Parsers.exec(re, str)
+        chunk = str[1:re.ovec[1]]
+        isempty(chunk) || parse_multipart_chunk!(chunk, dict)


What's the scenario where isempty(chunk) is true? It seems like if Parsers.exec(re, str) returns true, then it indeed matched a boundary, and yet this is saying that we didn't find the boundary? Maybe the regex needs a tweak here?

If re matches the beginning of the string, then I guess we'd have str[1:re.ovec[1]] == "" ? But I agree that this should actually not happen in practice and parse_multipart_chunk! should probably still work when given an empty string.

samoconnor · 2018-07-02T23:59:22Z

src/parsemultipart.jl

+        "content" => Any[]
+    )
+    re = Regex(boundary)
+    Parsers.exec(re, body) || return nothing


If theParsers module is only being used for exec, nextbytes etc then this file should just do include("parseutils.jl") instead of saying Parsers.exec(....

samoconnor · 2018-07-03T00:09:08Z

A few things it would be good to have:

Description (in docstrings) of what this PR implements in terms of the RFC(s).
Description of what parts of the RFC(s) that the implementation does not cover (if any).
Tests written to cover whatever behaviour is specified in the RFC(s).
User API docstrings and examples.
Quick comparison with how other HTTP libraries do this API.

piever · 2018-07-03T09:47:39Z

Yes, I still need to add tests and documentation.

I'm definitely not an expert on internet standards, is there some good source where I could see what I should expect to be present or not in a multipart form body?

samoconnor · 2018-07-03T10:55:48Z

Some starting points for specifications:
https://tools.ietf.org/html/rfc7231#section-3.1.1.4
https://tools.ietf.org/html/rfc2046#section-5.1
https://tools.ietf.org/html/rfc7578

piever · 2018-07-06T14:34:06Z

It's probably better to develop this only for 0.7 also because string processing seems to have change in ways I'm not sure I understand. When using:

HTTP.Parsers.exec(re, body)

I keep getting a ERROR: LoadError: PCRE.exec error: UTF-8 error: isolated byte with 0x80 bit set which I'm not sure how to get rid of. If needed I can upload a file with a saved body of an example request that generates this error. Is exec(re, body) the preferred way to separate the body in chunks according to the boundary or am I missing some better alternative?

EDIT: this is only a concern in the regexes to identify the "file content", which is the part where irregular characters may occur, so it would not be too bad to do that part by hand (it's just splitting by boundary first and then by two returns).

codecov-io · 2018-07-08T13:54:11Z

Codecov Report

Merging #264 into master will decrease coverage by 3.35%.
The diff coverage is 4.47%.

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
- Coverage   71.84%   68.49%   -3.36%     
==========================================
  Files          35       36       +1     
  Lines        1964     2028      +64     
==========================================
- Hits         1411     1389      -22     
- Misses        553      639      +86

Impacted Files	Coverage Δ
src/HTTP.jl	`77.27% <ø> (ø)`	⬆️
src/parsemultipart.jl	`0% <0%> (ø)`
src/Parsers.jl	`96.72% <100%> (ø)`	⬆️
src/multipart.jl	`65.57% <66.66%> (-1.1%)`	⬇️
src/debug.jl	`14.28% <0%> (-7.15%)`	⬇️
src/compat.jl	`73.68% <0%> (-5.27%)`	⬇️
src/Servers.jl	`69.23% <0%> (-4.95%)`	⬇️
src/Handlers.jl	`91.11% <0%> (-4.45%)`	⬇️
src/ConnectionPool.jl	`74.22% <0%> (-3.56%)`	⬇️
src/Streams.jl	`83.19% <0%> (-0.85%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5630bf4...a833cef. Read the comment docs.

piever · 2018-07-08T14:15:09Z

Still needs tests and docs but should have addressed most things. Now I no longer convert to strings before parsing but work directly with vectors of bytes, as otherwise it screws up if the UInt8 encoding of the file has invalid characters.

In terms of header handling, I though the following seemed best according to the RFC linked above (especially https://tools.ietf.org/html/rfc7578):

The header has to say Content-Disposition: form-data
The name field is necessary
Content-Type is not necessary: omitting it defaults to text/plain
filename is not necessary

The main difference with the Multipart type as implemented originally in this package are:

Multipart has no name field: I've added it and I think it should be considered in writemultipartheader function
Multipart assumes that data is of type IO: so far I've simply put my SubArray{UInt8} in a IOBuffer
Multipart assumes that filename is necessary whereas I don't think it should be, especially because some fields of the form may not be files
there is some inconsistency about how to handle missing fields: I've been using nothing and HTTP uses "", maybe it's simpler to always use "" esp. for backward compatibility
unlike Multipart struct I don't really have the field Content-Transfer-Encoding: and looking at https://tools.ietf.org/html/rfc7578 it seems like it's deprecated, so I'm not sure what to do about that

I'm curious to know what's the best way to proceed on the above points.

acdupont · 2019-06-11T20:38:06Z

Is this PR abandoned?

quinnj · 2019-12-06T17:54:48Z

Superseded by #427

added parser for multipart form

149ed89

switch to strings

c9db261

check for emptiness

17b2559

quinnj requested changes Jul 2, 2018

View reviewed changes

samoconnor reviewed Jul 2, 2018

View reviewed changes

Pietro Vertechi added 4 commits July 6, 2018 16:25

added manual ways of checking for boundary and content

a908fb5

rename

71f121b

reuse multipart type

cefd3b7

name regexes

483cd16

make sure to distinguish name and filename

a833cef

samoconnor mentioned this pull request Jul 10, 2018

Replace HTTP.download #273

Merged

acdupont mentioned this pull request Jun 17, 2019

WIP/RFC: added parser for multipart form #427

Closed

quinnj closed this Dec 6, 2019

		@@ -0,0 +1,52 @@
		function parse_multipart_chunk!(chunk, d)
		re = r"(\n){2,}\|(\r){2,}\|[\r\n]{4,}"

WIP/RFC: added parser for multipart form #264

WIP/RFC: added parser for multipart form #264

Uh oh!

Conversation

piever commented Jul 1, 2018

Uh oh!

piever commented Jul 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piever commented Jul 1, 2018

Uh oh!

quinnj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samoconnor commented Jul 3, 2018

Uh oh!

piever commented Jul 3, 2018

Uh oh!

samoconnor commented Jul 3, 2018

Uh oh!

piever commented Jul 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jul 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

piever commented Jul 8, 2018

Uh oh!

acdupont commented Jun 11, 2019

Uh oh!

quinnj commented Dec 6, 2019

Uh oh!

Uh oh!

piever commented Jul 1, 2018 •

edited

Loading

piever commented Jul 6, 2018 •

edited

Loading

codecov-io commented Jul 8, 2018 •

edited

Loading