-
-
Notifications
You must be signed in to change notification settings - Fork 590
Perfornance - Cache expensive url operations #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g by keeping fragments separated from URL (and avoid redunant frag/defrag). Conflicts: jsonschema/tests/test_benchmarks.py issue python-jsonschema#158: Use try-finally to ensure resolver scopes_stack empty when iteration breaks (no detectable performance penalty). * Replace non-python-2.6 DefragResult with named-tuple. * Add test-case checking scopes_stack empty. Conflicts: jsonschema/tests/test_validators.py jsonschema/validators.py
resolver = RefResolver.from_schema(schema) | ||
with resolver.resolving("#/a") as resolved: | ||
self.assertEqual(resolved, schema["a"]) | ||
with resolver.resolving("foo://bar/schema#/a") as resolved: | ||
with resolver.resolving("http://bar/schema#/a") as resolved: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change these to http
because (at least on py3) the behaviour of urljoin
changes based on the scheme. With an unknown scheme it was not replacing the fragment at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there's some global that defines weirdly which schemes it's fine with. I don't recall why this worked before then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it worked before because it wasn't joining the ref to the rest of the url. The url was being fetched first, then the ref was being used to locate the section in the document. Now the full url is being stored as the scope instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, makes sense.
ae2f281
to
613cf3e
Compare
@@ -38,6 +38,22 @@ def __repr__(self): | |||
return repr(self.store) | |||
|
|||
|
|||
class Cache(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to use something pre-baked for caching. functools.lrucache
with the backport module, say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work. This doesn't really need the "lru" portion, but with a sufficiently large value for maxsize, that should be fine.
Apparently both functools32
and repoze.lru
have this. I'll look into which ones work on py2.6
On a related note, do you think I should add a new kwarg to RefResolver
to disable caching entirely (since there is already an option to disable caching of remote refs) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a similar solution to what we do for remote cache would work, with the cache
just being an optional argument which constructs an appropriately configured lrucache
, but disabling caching would be providing a functools.lrucache
with maxsize=0 or whatever.
But yeah, sounds like a decent idea. Nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like functools32
is py2.7 only (I actually need 2.6 still).
repos.lru
works in 2.6, but it's not small (https://github.com/repoze/repoze.lru/blob/master/repoze/lru/__init__.py ~373 lines). I notice that currently there are no install_requires
dependencies.
I could add an install_requires for version < 3.2
or I could vendor it in jsonschema/vendor/repos/lru.py
. I usually prefer to add the dependency instead of vendoring it. Do you have a preference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ech, annoying. OK, yeah adding the dependency is fine with me.
It'd be nice to only do so where needed if there's a close enough interface between that and functools.lru_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, the interface is identical, I'll only install repoze.lru
for python2.
Adding a dependency required a small change to where __version__
is defined.
When trying to run setup, setup.py
imports __init__.py
imports validators.py
imports comat.py
tries to import repoze.lru
which will fail before the virtualenv doesn't exist yet. So I created a version.py
, moved __version__
there, imported it from both setup.py
and __init__.py
and use exec()
to read it without importing it.
I believe this is a pretty common way of handling this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly with lru_cache
the benchmark is back to ~21ms (it's ~19.5ms with my custom Cache object). I guess that is good enough though. I understand wanting to use something that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to switch to http://vcversioner.readthedocs.org/en/latest/ which handles that process for us and isn't overcomplex like some other tools that do the same sort of thing.
Good to hear that we can swap those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making the cache a parameter though right? So you should be able to drop in something faster if the 2ms are relevant for your app?
(Obviously I understand wanting stuff to be fast out of the box too :P)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good point. I was considering just making the maxsize a parameter, but making the decorator function itself a parameter makes sense.
Edit: Oh, it might have even just been that the value I was using for maxsize was too low for my testcase. Increasing it seems to restore the performance.
Left some more comments inline as I'm trying to review this in the quick couple of minutes I can grab :D Appreciated again, thanks a lot for pushing this forward. |
Thanks for the review and feedback! I've got a couple open questions (how to include the lrucache dep, and how composition would work), then I'll make the changes. |
Ok, I think I have all the changes in. |
Awesome. I'm out on business, but I'll have another look and try to merge on Sunday, thanks again! |
OK! Think this roughly lgtm -- some of the assertions that were removed I'd like to add back since they test backwards compatible things we're forced to leave, but I can add them back in after merging, which now I hopefully will have time for tomorrow. Thakns again! |
Awesome, looking forward to it. Thank you! |
Started merging (in https://github.com/Julian/jsonschema/tree/perf_cache_resolving) but adding back in the assertions I was referring to actually fails, so there's some more backwards incompat. I'll try to give it a bit more time early this week, unless you see immediately what the correct way to add |
Ah, I see what you were saying. My latest commit 7241db0 fixes it. |
OK managed to find slightly more time, and made a few small changes -- needed to add back backwards compatibility for RefResolvers without the new methods, and I changed the parameters to take caches, rather than parameters they can use to make caches. I think I'd like to move So we're close :), sorry it's taking so long. |
What would call |
The same place (no not added to validator) -- I'm imagining just putting the caches on Validator (in |
OK never mind, I tried it and like this way better :) Merged. Sorry again that this took so long, much appreciated! |
That's for all the work getting this merged! Since there is a bug fix included in this change (for python3), would it be possible to do a bug fix release (v2.4.1) ? |
I'd like to cut 2.5, but Travis is using a build of ubuntu that's failing the test suite on PyPy, which I have to figure out. I can certainly try to cut a bugfix release, but which bug fix is in here again? |
I guess I never opened a ticket for it. The issue is described in #201 (which I discarded because this fixed it in a better way). |
Ah the test flakiness... OK lemme see if I can make a branch with just the fix for that on it, hopefully tomorrow, if not sooner if I can just get 2.5 out as well. |
@Julian Reckon you'll have a chance to cut the bug fix release? (Or is 2.5 close)? |
I hope so, sorry :( I'm back from my trip, so hoping to have some spare
|
@Julian wanted to follow up if the patch could go in any immediate releases? |
OK, release is out. Sorry for the delay here all. Feel free to follow up with any issues. |
Based on the work in #182 (I cherry-picked a few commits) for issue #158
Replaces #202
Also fixes the problem described in #201
I rebased over many of my changes in #202. Many of them were unnecessary after adding caching.
On cpython 2.7 I'm seeing the benchmark runtime go from ~56ms to ~22ms. I don't expect any significant improvement in pypy, since it's already going to have this cached.
The primary change here is to add two caches to
RefResolver
, one forurljoin
, and another for resolving a reference.Removing the context manger from
ref()
validation (the latest commit), gets this down to ~19ms