Thinking about a new design for DataDeps (the old one wouldn't go away but the new one would be more flexible).
The notion would be to be based around paths (as in FilePaths.jl)
A DataDep would be a kind of Path. Completely resolved lazily
The DataDep would define the root of the path.
When asked for a file (or directory) within that DataDep root, it would look for a satisfier (matching selecting the first that matches), then execute it.
register(
DataDependency(
"Root", # i.e. name in current DataDeps.jl
"Message", # The message is only displayed, if the root local directory is empty,
)
)
DataDepPath()
Thinking about the next advancement of DataDeps.jl. @evizero are you around to brainstorm?
sure. I am just blankly staring at unicodeplot-tests (that i should have finished long ago) anyway
Some bits and half written notes are at https://github.com/oxinabox/DataDepsPaths.jl/
maybe to bring me up to speed. what are the current limitations you want to fix? or is it "just" a design cleanup
Key limitation is around the thing you brought up, when you want a datadep, but you only want some of the files. So like subdatadeps. There is another problem we ran into recently where we have a remote folder which contains records for every day, so like many GB of data, and we only want a few days (edited)
right
So the core change is that it would be path/(file/folder/subfolder) orientated, rather than folder orientated.
I'll take a look at MLDatasets real quick to jog my memory
This issue really oxinabox/DataDeps.jl#35
right i was thinking about SVHN2 right now as well
I’ve been looking at FilePaths.jl lately and I really like it.
So the notion I have is that a there would be a DataDepPath
say DataDepPath("Name/Foo/bar.csv")
And there would be a DataDepedency
with the name "Name"
and it would have as part of it a number of what I am calling satisfiers
(kinda like BinDeps “providers”).
And a satisfier would have a way to say what paths it should satisfy, and how to resolve it. (Where resolving it normally does something like the current fetch
,checksum
,postfetch
) (edited)
so what triggers what. if I say datadep"Foo/bar.csv"
it only requests the download of that file and returns that single path, while datadep"Foo"
downloads and returns all files. right?
is a satisfier linked to each individual file?
So I think satisfiers need to be linked to more than file. Say the file structure is
- AyeDir
- ayefile.csv
- beefile.csv
- BeeDir
- cee.txt
So datadep"Foo"
should result in the download of everything.
datadep"Foo/AyeDir"
should result in the download/creation of AyeDir/ayefile.csv
and AyeDir/beefile.csv
,
and this might be down by downloading a zip file for AyeDir
and extracting it.
right so datadep"Foo/AyeDir/ayefile.csv"
might do the same as its not really an individual file to download
Exactly.
datadep"Foo/AyeDir/ayefile.csv
should result in the creation of AyeDir/ayefile.csv
.
And in the previous senario that would be be done by the download of the AyeDir
zip, and its extraction, and so it would also creation of AyeDir/beefile.csv
.
right makes perfect sense, yes
So a satisfier needs to define a set of files it would satify.
I was thinking for full generality this might want to be a function, though I am not sure.
It being a list would be too annoying for things with lots of files.
Maybe Glob.jl or something glob like.
So you can say
Satisfier(glob"AyeDir/**", download_and_extract("http://example.com/Aye.zip")
I think I understand. I'll think on this during my lunch now
I’ll just get out the rest of the thought (feel free not to reply if you are eating)
let me complicated the previous structure.
- Foo
- AyeDir
- ayefile.csv
- beefile.csv
- BeeDir
- x.txt
- y.txt
- z.txt
- cee.txt
Let’s say individual satisfiers were defined
Satisfier("BeeDir/x.txt", download("http://example.com/x.txt")),
Satisfier("BeeDir/y.txt", download("http://example.com/y.txt")),
Satisfier("BeeDir/z.txt", download("http://example.com/z.txt")),
And you asked for datadep"Foo/BeeDir"
.
Now it needs to go an resolve all three satisfiers.
So the tricky part i think really is how to determine the match part of the Satisfier
,
to cover both the cases where one Satisfier
, solves multiple DataDepPath
s,
and the case where one DataDepPath
is requires resolving multiple Satisfier
s (edited)
Satisfier("...", () -> download("..."))
you mean, right?
I think that makes sense. do you have a full mock definition of a dataset somewhere?
no, not yet. Assembling a few of those would help
I do like it conceptually. the question is more what the boilerplate is
I’m not sure on how to be specifying matching for Satisfiers. (I’ll be back in 15)
but i think that matching logic should at least be doable. With that in mind it might be more informative to find a good syntax first (edited) Which may be done already, but I am having a hard time visualizing it
Hmm, I guess things would be simplified a lot it hierarchical structure was required upon Satisfiers and the paths they fill.
So that you couldn’t have for example a Satisfier
that provides just ["AyeDir/ayefile.csv", "BeeDir/x.txt]
.
and maybe you couldn’t have one that provides ["BeeDir/x.txt", "BeeDir/y.txt"]
without also providing "BeeDir/z.txt"
(edited)
My main point is that if you try to translate existing real world datadep definitions to the new form, you might encounter issues/questions you wouldn't otherwise think about and it seems like you are at that point in your design phase where the big picture makes sense
1 example in, MNIST>
idk where exactly checksums fit into this now.
I guess as part of a Resolver
.
https://github.com/oxinabox/DataDepsPaths.jl/blob/master/test/examples.jl
I think writing Satisfiers
as Pairs
is nice.
Oh for the use case I mentioned before, with the 1 file per day.
The remote URL actually needs to be generated from the local path…
Maybe that could be written as
local_basename -> Resolver(joinpath(rurl, local_basename)
Which would have nice symmetry for that dynamic case to the static case being written as
"Foo" => Resolver(joinpath(rurl, "Foo"))
is this an either/or? in the sense that either all are provided as "foo" => ...
or as a single function
I think static named folder paths can have dynamic (or static) named subfiles/subfolders but I am not sure if dynamic named subfolders could have subitems at all, I guess they could.
https://github.com/oxinabox/DataDepsPaths.jl/blob/master/test/examples.jl#L56
how does BeeDir know to unpack?
should this be a nested Resolver
?
oh wait. I think I am confused
Because BeeDir
is a Vector
on the RHS
right its just a "virtual" folder more or less not a zip file
It would be a real folder on disk is what I am thinking.
makes sense
But it doesn’t correspond to a Zip file, but to a a list of other ( I guess you could call them) sub-satifiers (edited)
I still think (like outlined in the issue you linked to that i opened) it would be nice if they unpack stage was a bit more priviledged
in the sense that imagine I have a dataset that when downloaded is a bunch of mat files. that is the native format. I would like to say what the local dowloaded mat file would be called so that i can check if someone put them there manually. if they aren't there I would like to download them. then I would like to process them into a different format with a different local file name This almost works right now except there is no way to manually download the raw files and then have DataDeps just do the post_fetch
I think it would almost be possible to implement this if only the local file name way known. like
"AyeDir" => Resolver("http://example.com/Aye.zip", unpack, "Aye.zip")
i guess in most cases one could deduce the file name anyway
from the url
What if code take makes partial products, was also handled as a Satisfier
? (maybe we can get rid of the seperation of fetch
from post_fetch
)
E.g
"Foo.mat" => Resolvers("http://eg.com/Foo.mat"),
"Foo.jld" => Resolver(() -> convert_mat2jld(DataDepPath("Name/Foo.mat"))
mhm intruiging
maybe
"Foo.jld" => Resolver(DataDepPath("Name/Foo.mat"), convert_mat2jld)
replied to a thread:
Right, I see now.
I think I like the idea of using Satisifiers
for it better. Feels less adhoc.
So one would have a Satisfier for AyeDir
that unpacks Aye.zip
,
and a Satisfier for Aye.zip
that downloads it.
Interestly, it is safe to delete Aye.zip
as part of satisfying AyeDir
, if one wants. If you expect that noone will want to Aye.zip
once they have AyeDir
. (edited) View newer replies
One could maybe be using a dynamic satisfier to do something like
"Foo.mat" => Resolvers("http://eg.com/Foo.mat"),
"Bar.mat" => Resolvers("http://eg.com/Bar.mat"),
jldfile -> Resolver(DataDepPath("Name/" * replace(jldfile, ".jld"=>".mat"Foo.mat")), convert_mat2jld)
(edited)
so the function is basically the default if nothing else matches?
Yeah that was what I was thinking. One problem with dynamic satisfiers, is that it is impossible to resolve them as part of resolving their parent folder. So if we had
"Foo" => filename -> Resolver(joinpath(rurl,"Foo", filename)),
then doing download(DataDepPath("Foo"))
,
would not actually be able to download anything.
right well to play devils advocate, using it doesn't have to make sense at every possible place. the utility at those that do is pretty good though
Which 2 features? (This has been a big discussion. So it would be good to me to know what you see as the two big take aways.)
the other two aspects feel like great features though From a thread in #dataToday at 2:13 PMView reply
- to support predownloaded raw files that then only cause the post_fetch_method to be executed
- to support downloading only required files if requested, or the whole dataset if requested, and still have it all in one place
- to support downloading files with remote name dependent on (runtime) local name (dynamic satisfiers) Ok, I am going to dump this discussion into a markdown file on github. Thanks a bunch
One can also basically treat multistep experiments as a datadeppath.
That can be resumed at any stage.
DataDepPathRegistration(
"sci/raw"=>()-download("http://raw.example.com")
"sci/model" => ()->bson("model.bson", train(model, (datadep"sci/raw/train.csv"))
"sci/results" => ()->save("res.csv", evaluate(BSON.load(datadep("sci/model/model.bson", datadep"sci/raw/test.csv")
"sci/analysis" => ()->generate_plots(".", datadep"sci/results/res.csv")
)
If you change one part you can just delete all later parts and then call
display("text/svg", datadep"sci/analysis/confusionmatrix.svg")
and it would all regenerate.