-
Notifications
You must be signed in to change notification settings - Fork 210
Fix "I'm feeling lucky" search to include all crates #1251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
-- this query too often. | ||
-- it depends on the percentage of crates with > 100 stars | ||
SELECT DISTINCT 1 + trunc(random() * params.max_id)::INTEGER AS id | ||
FROM params, generate_series(1, 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coding 500 makes me a little nervous. Could we generate this live based on production info? Or does that slow down the query too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generating that value live slows down the query too much, I valued speed higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number depends on the ratio of crates >100 stars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while thinking about this, I have another idea (but not likely it's faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if it works out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't help 😢 was too slow :)
@jyn514 thanks for the review! I'll check the comments in detail in the evening today |
The metrics were mostly because I was curious, I wanted to put a counter on the I'm feeling lucky button that reported how many times it was actually used |
src/web/releases.rs
Outdated
// since there is a chance of this query returning an empty result, | ||
// we retry a certain amount of times, and log a warning. | ||
for _ in 0..20 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to use a fallible query, is there not a way to make it always work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kixiron I tried (I think) 5-10 different approaches, the ones that always work are either too slow or involve things like a recursive CTE (which is in in the end similar to retrying here, while this is more readable IMHO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran some numbers: current query runs between 2 and 8 ms (~)
and delivers a random crate directly in most cases (50 misses for 10k queries),
the repetitions are more or less a safeguard.
Also I could not test on a production dataset so I don't know the amount of gaps in the id-span on the crate-table (and the production github stars without backfilling everything locally)
@Kixiron wouldn't we get that from the access logs too? or don't we have these? I'll check other examples and see if it's easy to add |
@Kixiron @jyn514 I (hopefully) answered your questions and added two commits with the prosed log-statement changes and metrics. When wanted I could likely produce something better readable, but slower ( ~100 or even 200 ms ) , or we could also just add the metrics to see if the feature is actually used or could be removed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we get that from the access logs too? or don't we have these?
The access logs are not integrated with the grafana logging, and also don't work when running locally. Plus I prefer to have as little depending on nginx as possible because it's not tested at all and doesn't go through review.
or we could also just add the metrics to see if the feature is actually used or could be removed :)
I would be interested in that.
When wanted I could likely produce something better readable, but slower ( ~100 or even 200 ms ) ,
I would prefer to have the faster query I think, the searches take long enough already. Thanks for offering though :)
Also I could not test on a production dataset so I don't know the amount of gaps in the id-span on the crate-table (and the production github stars without backfilling everything locally)
Let me know how to test this, happy to runs some queries. I did run the 'redirect_to_random_query' once with EXPLAIN ANALYZE
and it took less than a millisecond, great job :)
-- this query too often. | ||
-- it depends on the percentage of crates with > 100 stars | ||
SELECT DISTINCT 1 + trunc(random() * params.max_id)::INTEGER AS id | ||
FROM params, generate_series(1, 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if it works out :)
To validate what I understand: I would create a separate PR only adding the metrics. Then we watch how often this is actually used and we decide to drop or fix? @jyn514
I just wrote a small (ugly) tool: https://gist.github.com/syphar/949cdb2ee297ca5695bad179cfd99a6e With that you should be able to play around with the numbers :) |
I think this is a good change as-is, I would be happy merging now and we can add the metrics PR later.
Thanks! I tested this in prod and it never came up missing:
So I don't think we have to worry. Is this ready to merge? It looks good from my end :) |
After the comment by @Kixiron I already added metrics, as I understood how it would have to be done (see e9bfc47 ) .
From my side It would be ready, except we want more tests, different metrics, or any other change |
@jyn514 just discovered by accident that my testing-gist used an even smaller number or random IDs (100) and still always delivered results. But 500 is fast enough, so I guess we're fine with the current number for some time. |
This fixes the I'm feeling lucky search ( #789 ) to do an efficient search over all creates, not only the first 280 ones.
I was searching and testing through many different ways to get a random record, and found this to be the best for our dataset. (see also this stackoverflow answer for some overview).
I tested this locally with a dummy-dataset with 600 crates >100 stars, and out of 10k of these queries there were only <50 which needed a second try. In the statement execution time was <10ms all the time.
Other approaches felt too complex in code, were too slow, or were too slow randomly.
I'm happy to implement any needed changes.
In #789 @Kixiron was mentioning metrics to be added, while I'm not sure which metrics are needed.