Skip to content

Custom parser for dates. #18000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Custom parser for dates. #18000

wants to merge 1 commit into from

Conversation

simonbyrne
Copy link
Member

This replaces the default date parser with a custom-coded version. It is approximately 50x faster on the following snippet:

datestr = map(string,range(DateTime("2016-02-19T12:34:56"),Dates.Millisecond(123),100_000))
testDateTime(v) = map(DateTime,v)

@time testDateTime(datestr)

@simonbyrne
Copy link
Member Author

simonbyrne commented Aug 12, 2016

cc: @ViralBShah @quinnj @JeffBezanson

@kshyatt kshyatt added parser Language parsing and surface syntax dates Dates, times, and the Dates stdlib module labels Aug 12, 2016
@iamed2
Copy link
Contributor

iamed2 commented Aug 13, 2016

cc: @omus

@omus
Copy link
Member

omus commented Aug 16, 2016

Something I would like to see regarding date parsing/formatting would be to parameterize the DateFormat type with the format being used. Something similar toDateFormat{Symbol("yyyy-mm-ddTHH:MM:SS")}. The advantage of this would be the ability to have specialized parsers for commonly used formats.

@omus
Copy link
Member

omus commented Aug 16, 2016

Since this change is about performance there should probably be some performance tests added to https://github.com/JuliaCI/BaseBenchmarks.jl regarding parsing dates.

@JeffBezanson
Copy link
Member

Yes, it would be nice to have a way to hack in parsers for various common formats as they come up. Either dispatch or just if/else statements that return DateFormats that know how to call the right function.

@simonbyrne
Copy link
Member Author

I should have pointed to the discussion on the open issue #15888.

There is Dates.DateFormat for custom formats, but that's slow. Really the date parsing stuff probably needs quite a large overhaul. This was more of a stop-gap (and potentially back-portable) way to handle the default case.

@ViralBShah
Copy link
Member

Isn't the parser tag for julia parser stuff?

@JeffBezanson
Copy link
Member

There is Dates.DateFormat for custom formats, but that's slow.

Which is why I want, eventually, the ability to manually write parsers for specific formats. I'm thinking of something like:

function DateFormat(str)
    if str == "yyyy/mm/dd"
        return special_fast_parsing_function
    else if str == ...
        ...
    else
        return slow_fallback
    end
end

Isn't the parser tag for julia parser stuff?

Yes.

@JeffBezanson JeffBezanson removed the parser Language parsing and surface syntax label Aug 17, 2016
@omus
Copy link
Member

omus commented Aug 17, 2016

Which is why I want, eventually, the ability to manually write parsers for specific formats

I also want this. My only issue with the if/else approach is that it won't be extensible by other packages. I'm specifically thinking about TimeZones.jl and the z/Z format.

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 17, 2016
@simonbyrne
Copy link
Member Author

I've added an AbstractDateFormat and a FastDateFormat (suggestions for better names welcome), which should make this more easily extensible.

@@ -258,14 +250,15 @@ backslash. The date "1995y01m" would have the format "y\\ym\\m".
DateTime(dt::AbstractString,format::AbstractString;locale::AbstractString="english") = DateTime(dt,DateFormat(format,locale))

"""
DateTime(dt::AbstractString, df::DateFormat) -> DateTime
DateTime(dt::AbstractString, df::AbstractDateFormat) -> DateTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change needs to be made in the rst docs too

@simonbyrne
Copy link
Member Author

Updated the docs, any other comments?


Construct a `DateTime` by parsing the `dt` date string following the pattern given in
the [`DateFormat`](:func:`Dates.DateFormat`) object. Similar to
`DateTime(::AbstractString, ::AbstractString)` but more efficient when repeatedly parsing
similarly formatted date strings with a pre-created `DateFormat` object.
"""
DateTime(dt::AbstractString,df::DateFormat=ISODateTimeFormat) = DateTime(parse(dt,df)...)
DateTime(dt::AbstractString,df::DateFormat) = DateTime(parse(dt,df)...)
DateTime(str::AbstractString) = DateTime(str,ISODateTimeFormat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be dt::AbstractString to match the above signature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, updated.

@simonbyrne simonbyrne force-pushed the sb/fastdates branch 2 times, most recently from ede295b to 00a120c Compare September 7, 2016 22:29
@ViralBShah
Copy link
Member

Should we merge this now, as we figure out improvements going forward?

@tkelman
Copy link
Contributor

tkelman commented Oct 18, 2016

Let's make sure CI still passes, there have been a month of changes on master since it last ran here.

edit: I think the osx failure dlopen(libopenblas.dylib, 1): image not found is due to the branch's .travis.yml being a little out of date, and not a real issue

edit2: win64 appveyor failure is just #16555, also nothing new

@stevengj
Copy link
Member

Bump?

@stevengj
Copy link
Member

On an unrelated note, I just noticed that we are missing parse(DateTime, str) = DateTime(str)

@JeffBezanson
Copy link
Member

We should put in an explicit check for better-supported format strings like "yyyy-mm-dd", so you get good performance if you make the "mistake" of asking for the format by string. Avoiding the performance landmine is worth some ugliness.

@stevengj
Copy link
Member

(A benchmark of this was added to BaseBenchmarks.)

@stevengj stevengj removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 19, 2016
@simonbyrne
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels

@iamed2
Copy link
Contributor

iamed2 commented Oct 19, 2016

Looks like no date parse benchmark was run?

@tkelman
Copy link
Contributor

tkelman commented Oct 19, 2016

I think new benchmarks need to be manually deployed, it's not automatic from merging PR's

@jrevels
Copy link
Member

jrevels commented Oct 19, 2016

The tuning process is running now, I'll retrigger benchmarks here once it finishes.

EDIT: Tuning process failed, so it'll be a little while longer. There was a bug in a recently added benchmark, I fixed it in JuliaCI/BaseBenchmarks.jl/pull/35. Once tests pass and I merge the fix, I'll restart the tuning process...

@jrevels
Copy link
Member

jrevels commented Oct 19, 2016

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@omus
Copy link
Member

omus commented Nov 14, 2016

I've done some iterating on this proposal and have some code which doesn't need @chk1 and improves performance further. I'll try to turn it into a PR soon.

@JeffBezanson
Copy link
Member

@omus Any update, or should we just merge this?

@omus
Copy link
Member

omus commented Dec 7, 2016

@JeffBezanson thanks for the bump. I'll try to get out my proposal tonight. If I can't we can merge this first.

@tkelman
Copy link
Contributor

tkelman commented Dec 29, 2016

this is superseded by / contained in #19545, right?

@omus
Copy link
Member

omus commented Dec 29, 2016

@tkelman correct

@tkelman tkelman closed this Dec 29, 2016
@tkelman tkelman deleted the sb/fastdates branch December 29, 2016 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dates Dates, times, and the Dates stdlib module performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.