Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythoned UniqueRepresentation #14054

Closed
simon-king-jena opened this issue Feb 3, 2013 · 107 comments
Closed

Cythoned UniqueRepresentation #14054

simon-king-jena opened this issue Feb 3, 2013 · 107 comments

Comments

@simon-king-jena
Copy link
Member

UniqueRepresentation provides a comfortable way to create unique parent structures, and automatically provides a hash and certain comparison methods.

Problems:

  • It relies on a metaclass, namely ClasscallMetaclass and thus has to be a Python class. That's bad for speed.
  • It combines two features, namely a cache and unique instance behaviour. But in many some cases we want a cache so that two distinct instances can still be equal.

Here, I suggest to

  1. separate the two features, by creating a new class sage.unique_representation.CachedRepresentation as one base of UniqueRepresentation.
  2. create a new cdef class sage.misc.fast_methods.WithRichCmpById, that provides hash and rich comparison, as expected for unique instance behaviour.

Apply

Depends on #14017
Depends on #6495
Depends on #14182
Depends on #14040
Depends on #14011

CC: @nthiery

Component: performance

Keywords: cython UniqueRepresentation

Author: Simon King

Reviewer: Travis Scrimshaw

Merged: sage-5.8.beta4

Issue created by migration from https://trac.sagemath.org/ticket/14054

@simon-king-jena
Copy link
Member Author

@simon-king-jena
Copy link
Member Author

comment:1

Note that the patch also needed to change some auxiliary class CartanType_simple_finite, which is used to unpickle some old data. It used to inherit from object, but for an incompatibility of Cython types it has to inherit from UniqueRepresentation instead.

For the timings, I use MatrixSpace, which inherits from UniqueRepresentation:

sage: isinstance(MatrixSpace(GF(3),2,3), UniqueRepresentation)
True

With sage-5.6.rc0 plus #14017:

sage: %time L = [MatrixSpace(GF(3),n) for n in range(10000)]
CPU times: user 1.94 s, sys: 0.05 s, total: 2.00 s
Wall time: 2.00 s
sage: %time D = dict(zip(L,range(len(L))))
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
sage: MS = MatrixSpace(GF(3),10000)
sage: MS in D
False
sage: timeit("MS in D", number = 10^6)
1000000 loops, best of 3: 552 ns per loop
sage: MS = L[5000]
sage: MS in D
True
sage: timeit("MS in D", number = 10^6)
1000000 loops, best of 3: 540 ns per loop

Adding the patch from here:

sage: %time L = [MatrixSpace(GF(3),n) for n in range(10000)]
CPU times: user 1.96 s, sys: 0.04 s, total: 2.00 s
Wall time: 2.00 s
sage: %time D = dict(zip(L,range(len(L))))
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
sage: MS = MatrixSpace(GF(3),10000)
sage: MS in D
False
sage: timeit("MS in D", number = 10^6)
1000000 loops, best of 3: 187 ns per loop
sage: MS = L[5000]
sage: MS in D
True
sage: timeit("MS in D", number = 10^6)
1000000 loops, best of 3: 176 ns per loop

Hence, the time drops by 2/3. I did run make ptest successfully. Needs review!

@simon-king-jena
Copy link
Member Author

comment:2

Cc to Nicolas as the original author of UniqueRepresentation.

@simon-king-jena
Copy link
Member Author

comment:3

I just did some experiments with the tp_hash function from Python's C-API: It is 1/3 faster than a cythoned hash. I'll try to do similar things with the comparison methods.

@simon-king-jena
Copy link
Member Author

comment:4

While we are at speeding up UniqueRepresentation, I think we should actually refactor it. Namely, UniqueRepresentation serves two purposes, namely (1) caching, and (2) comparison and hash by identity.

Some classes misuse UniqueRepresentation by only using feature (1), overriding comparison in a way that violates the unique representation condition. See my monologue at sage-devel.

I suggest to split the two features.

@simon-king-jena
Copy link
Member Author

comment:5

FWIW, I finished a more experimental and rather intrusive version of UniqueRepresentation.

Idea:

  • Create a cdef function, that results in faster C code than what Cython makes of

    def __hash__(self):
        return id(self)
    
  • Override tp_hash with this function, for every instance of UniqueRepresentation. Likewise for tp_richcompare.

It remains possible to override those parts of comparison that can't be decided by looking at identity (such as "a<b" if "a is not b").

It would be great to just define the fast hash and comparison for UniqueRepresentation itself, but alas it seems that subclasses forget these settings, whether they override __hash__ or not. See the comments on sage-devel.

I still think it is a good idea to separate UniqueRepresentation from CachedRepresentation, but I am not so sure about enforcing the uniqueness behaviour, without the possibility to override it---this wouldn't be pythonic...

Let the patchbot do some work:

Apply trac14054_fast_methods.patch

@simon-king-jena
Copy link
Member Author

comment:6

From sage-devel, I understand that people think that separating the cache feature from the uniqueness feature of UniqueRepresentation is a good idea.

However, in my old patch, I was enforcing uniqueness behaviour for instances of UniqueRepresentation. This isn't pythonic. Hence, I do differently in the new patch version.

Question

In the current implementation, inheritance from UniqueRepresentation will overload rich comparison (==, >=, !=, etc.) inherited from a base class, but it will not overload comparison (cmp). Do you think that both should be overloaded?

Patchbot reported two failures with the previous patch version. I guess that's because of an additional dependency. So, as soon as I have a decent internet connection, I'll download the latest beta, and rebase on top of it.

Apply trac14054_fast_methods.patch

@simon-king-jena
Copy link
Member Author

comment:7

Other question: Does provide_hash_by_id really make sense to have?

@simon-king-jena
Copy link
Member Author

comment:8

The updated patch should make the coverage script happy, but three tests with 5.7.rc0 currently fail. I am downloading 5.7.rc0 now.

Apply trac14054_fast_methods.patch

@sagetrac-jlopez
Copy link
Mannequin

sagetrac-jlopez mannequin commented Feb 19, 2013

comment:9

I am not qualified to look over this ticket, but glancing at it I spotted this

308	        complete = complete = self.complete() 

which looks like a typo.

@simon-king-jena
Copy link
Member Author

comment:10

Replying to @sagetrac-jlopez:

I am not qualified to look over this ticket, but glancing at it I spotted this

308	        complete = complete = self.complete() 

which looks like a typo.

Yes, thank you for spotting it! Will remove it when I rebase the patch against 5.7.rc0.

@tscrim
Copy link
Collaborator

tscrim commented Feb 19, 2013

comment:11

Hey Simon,

Let me know when this is ready for review again.

Best,

Travis

@tscrim
Copy link
Collaborator

tscrim commented Feb 19, 2013

Reviewer: Travis Scrimshaw

@simon-king-jena
Copy link
Member Author

comment:12

The patch should now be ready for review.

Two questions that I'd like to have addressed:

  1. I introduce provide_hash_by_id, but I don't use it. Shall it be deleted?
  2. Is it enough to override the rich comparison methods? Currently, one can have hash(a)!=hash(b) but cmp(a,b)==0. Does this violate the contract of hash functions? Or is it enough that hash(a)!=hash(b) implies a!=b?

Pathbot:

Apply trac14054_fast_methods.patch

@simon-king-jena

This comment has been minimized.

@simon-king-jena
Copy link
Member Author

comment:13

To answer the question about the hash contract:

Apparently dictionaries use the rich comparison methods and not cmp:

sage: class Bla(object):
....:     def __hash__(self):
....:         return 2
....:     def __cmp__(self, other):
....:         return 0
....:     def __eq__(self, other):
....:         return self is other
....:     def __ne__(self, other):
....:         return self is not other
....:     
sage: a = Bla()
sage: b = Bla()
sage: a==b
False
sage: hash(a)==hash(b)
True
sage: cmp(a,b)
0
sage: D = {a:1}

Since hash(a)==hash(b), a and b belong to the same hash bucket of the dictionary. And comparison by cmp tells that they are equal. But we still have:

sage: D[b]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-6cf1ee6b63d5> in <module>()
----> 1 D[b]

KeyError: <__main__.Bla object at 0x5064310>

Hence, dictionaries use comparison by ==, and thus my patch does the right thing, IMHO.

@nbruin
Copy link
Contributor

nbruin commented Feb 19, 2013

comment:14

Replying to @simon-king-jena:

  1. I introduce provide_hash_by_id, but I don't use it. Shall it be deleted?

Yes, you should. I'm pretty sure changing type->tp_hash after calling PyType_Ready() is not supported by the Python C-API (see the inheritance problems we saw).

  1. Is it enough to override the rich comparison methods? Currently, one can have hash(a)!=hash(b) but cmp(a,b)==0.

Does this violate the contract of hash functions? Or is it enough that hash(a)!=hash(b) implies a!=b?

I think Python requires hash(a)!=hash(b) implies not(a==b), which Python does not enforce to be the same thing. In any case, I'm pretty sure that "rich comparison" is fully exhausted before trying to use __cmp__, which is only there because of backward compatibility.

There's another peculiarity for membership and lookup in python:

sage: class neq(object):
....:     def __eq__(self,other):
....:         return False
....:     def __ne__(self,other):
....:         return True
....:     
sage: a=neq()
sage: V={a}
sage: a in V
True
sage: sage: [a == v for v in V]
[False]

They explicitly mention this in the documentation: because hash collisions are rare, they first test "is" for dict lookup before trying "==" (it's cheaper).

@simon-king-jena
Copy link
Member Author

comment:15

Replying to @nbruin:

Replying to @simon-king-jena:

  1. I introduce provide_hash_by_id, but I don't use it. Shall it be deleted?

Yes, you should.

OK.

Does this violate the contract of hash functions? Or is it enough that hash(a)!=hash(b) implies a!=b?

I think Python requires hash(a)!=hash(b) implies not(a==b),

Right, that's a difference.

In any case, I'm pretty sure that "rich comparison" is fully exhausted before trying to use __cmp__, which is only there because of backward compatibility.

Nope! See my post on sage-devel.

If you have a non-cdef class, then it seems __richcmp__ is simply ignored (but methods like __eq__ have precedence over __cmp__). In a cdef class, __richcmp__ has priority over __cmp__ when deciding binary relations such as ==, <, <=, etc. But for cmp(,), __cmp__ will be used, even if there is __richcmp__!

So, the decisive question is: Do dictionaries compare stuff by cmp(a,b)==0 or by a==b?

It is the latter, and thus making __richcmp__ compatible with __hash__ was the right thing to do.

For now, it needs work, because I will delete the C API hack, and apparently some script of the patchbot has a complaint...

@simon-king-jena
Copy link
Member Author

Attachment: trac14054_fast_methods.patch.gz

Separate cache and uniqueness.

@simon-king-jena
Copy link
Member Author

comment:16

Done!

It seems that the patchbot complains about an increased startup time. How did that happen? Is it because of a slightly longer mro for UniqueRepresentation?

Patchbot:

Apply trac14054_fast_methods.patch

@nbruin
Copy link
Contributor

nbruin commented Feb 20, 2013

comment:17

What's your rationale to define 'a < b' etc. when a is b? I agree that it's likely that your answers are what any ordering override will want, but Python does not mandate any particular properties of semantics of these operators. So I think the proper thing is to just not do anything there. It's going to be a rare occasion anyway, so a little speed gain won't be very noticeable anyway.

The behaviour you are implementing is much easier to explain if you don't override those cases: Then you can just say that WithRichCmpById provides __hash__, __eq__ and __ne__, period.

I think WithEqualityById would make a better conceptual name that is less dependent on the implementation, by the way. Your motivation here seems to be as much to provide conceptual units in preference of just technical implementation tools as to provide more efficient implementations. Someone writing a Python class may not know what "richcomp" is, and doesn't need to.

@tscrim
Copy link
Collaborator

tscrim commented Feb 20, 2013

comment:18

Hey,

Replying to @nbruin:

What's your rationale to define 'a < b' etc. when a is b? I agree that it's likely that your answers are what any ordering override will want, but Python does not mandate any particular properties of semantics of these operators. So I think the proper thing is to just not do anything there. It's going to be a rare occasion anyway, so a little speed gain won't be very noticeable anyway.

So how would you want comparisons in the complex numbers to behave? In python, they return an error:

sage: complex(2+2*i) < complex(2-2*i)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-ce2e269601f6> in <module>()
----> 1 complex(Integer(2)+Integer(2)*i) < complex(Integer(2)-Integer(2)*i)

TypeError: no ordering relation is defined for complex numbers

However in sage they are not so well-behaved (see #14088):

sage: CC(1+2*i) < CC(2-2*i)
True

so there's no prescribed sage way.

As far as the actual comparison code, I think it would be better to do something like

if m == 2:
    return True
elif m == 3:
    return False
else:
    return m == 1 or m == 5

but this is a micro-optimization, so feel free to ignore this.

The behaviour you are implementing is much easier to explain if you don't override those cases: Then you can just say that WithRichCmpById provides __hash__, __eq__ and __ne__, period.

I think WithEqualityById would make a better conceptual name that is less dependent on the implementation, by the way. Your motivation here seems to be as much to provide conceptual units in preference of just technical implementation tools as to provide more efficient implementations. Someone writing a Python class may not know what "richcomp" is, and doesn't need to.

I believe the response to both of these points depends on how we want to do comparison between objects (when there is not a [natural] ordering).

Last thing for now, shouldn't we issue a deprecation warning for FastHashable_class since it has changed locations?

Thanks,

Travis

@nbruin
Copy link
Contributor

nbruin commented Feb 20, 2013

comment:19

The issue is that if someone wants to define

class C(WithEqualityById):
    def __cmp__(self,other):
        return -1

they will find that

sage: a=C()
sage: a < a
False

which might surprise them, because they thought that WithEqualityById only affected (in)equality testing and let ordering comparisons fall through to __cmp__ if implemented. With the current code, this is not the case if the two arguments are identical.

There have been extensive discussions about what inequality testing SHOULD be in sage and for the most part the implementation here is staying clear of the topic (which I think is a good thing). There's just this one small optimization that's probably usually OK, but if left out makes for much more predictable behaviour.

@simon-king-jena
Copy link
Member Author

comment:20

Replying to @nbruin:

What's your rationale to define 'a < b' etc. when a is b? I agree that it's likely that your answers are what any ordering override will want, but Python does not mandate any particular properties of semantics of these operators.

Python does not. But I think "comparison by identity" implies a semantics. And that is: a < a is False, a <= a is True.

The behaviour you are implementing is much easier to explain if you don't override those cases: Then you can just say that WithRichCmpById provides __hash__, __eq__ and __ne__, period.

OK.

I think WithEqualityById would make a better conceptual name that is less dependent on the implementation, by the way.

Good idea.

Replying to @tscrim:

Last thing for now, shouldn't we issue a deprecation warning for FastHashable_class since it has changed locations?

I think I introduced it. But when? Or what ticket? I don't think anyone used it.

@tscrim
Copy link
Collaborator

tscrim commented Mar 5, 2013

comment:70

FWIW I also get the same error as the patchbot.

@simon-king-jena

This comment has been minimized.

@simon-king-jena
Copy link
Member Author

comment:71

Now it should work(T).

Apply trac14054_fast_methods-5.8.patch trac_14054-fix-rigged-list.2.patch

@tscrim
Copy link
Collaborator

tscrim commented Mar 5, 2013

comment:72

This fails to apply for me:

patching file sage/combinat/rigged_configurations/rigged_configurations.py
Hunk #3 FAILED at 930
1 out of 3 hunks FAILED -- saving rejects to file sage/combinat/rigged_configurations/rigged_configurations.py.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working dir
errors during apply, please fix and refresh trac_14054-fix-rigged-list.patch

@simon-king-jena
Copy link
Member Author

comment:73

Argh. Jeroen forgot to update the list of dependencies, it seems.

@simon-king-jena
Copy link
Member Author

Changed dependencies from #14017, #6495, #14182, #14040 to #14017, #6495, #14182, #14040, #14011

@simon-king-jena
Copy link
Member Author

comment:74

Rebasing really has been trivial. The error was about a missing newline at the end of the file...

Apply trac14054_fast_methods-5.8.patch trac_14054-fix-rigged-list.2.patch

@simon-king-jena
Copy link
Member Author

comment:75

Note that the tests in rigged_configurations do pass for me, with

trac11490-coercion_tutorial.patch
14182_whitespace.patch
trac_13618-rings_doc_real-ts.patch
trac_13618-rings_doc_complex-ts.patch
trac_13618-rings_doc_others-ts.patch
trac_14040_housekeeping.patch
trac_14040_repr_option.patch
trac_14040_doctest.patch
trac_14040_review.patch
trac14054_fast_methods-5.8.patch
trac_12313-mono_dict-combined-random-sk.patch
trac_13387-combined.patch
trac_13387-guard_gc.patch
trac_14159_weak_value_triple_dict.patch
trac_14159_use_cdef_get.patch
trac12951_cached_extension.patch
trac_14214-cython_homset.patch
trac_14214-backup_cache.patch
trac_14063-remove_cc_compositions-ts.patch
trac_13688-finite_sets_cardinality_override-ts.patch
trac_14138.patch
trac_14138-doctests.patch
trac_13605-partition_options-ts.patch
trac_14011-Sphinx_roles-fh.patch
trac_14054-fix-rigged-list.2.patch

applied on top of sage-5.8.beta0. But they fail if I have some more patches applied.

In other words, I get the impression that sorting of the output of RiggedConfigurations.list() is broken.

@simon-king-jena
Copy link
Member Author

comment:76

Arrgh. Can it be that sage.combinat.rigged_configurations.rigged_configuration_element.RiggedConfigurationElement has no sorting implemented? Travis, you are the author. How can one implement comparison of these things?

Jeroen, what can we do, since apparently sorting the output is not available here? I would say that it is not the fault (for a reasonable definition of "fault") of my patch that the test in rigged_configurations is unstable.

@tscrim
Copy link
Collaborator

tscrim commented Mar 5, 2013

comment:77

Would it be better to give them an (somewhat unnatural) ordering or to change the doctest?

Actually...why not just make it #random with a note saying the output order can be random since there is no ordering?

@jdemeyer
Copy link
Contributor

jdemeyer commented Mar 5, 2013

comment:78

Replying to @simon-king-jena:

I would say that it is not the fault (for a reasonable definition of "fault") of my patch that the test in rigged_configurations is unstable.

True, but with doctests there is no alternative but to shoot the messenger (i.e. the person adding the patch causing the doctest failure always gets the blame)

@simon-king-jena
Copy link
Member Author

comment:79

Replying to @tscrim:

Would it be better to give them an (somewhat unnatural) ordering or to change the doctest?

Actually...why not just make it #random with a note saying the output order can be random since there is no ordering?

I would be fine with making it random, and test that the length of the output (or another easy invariant) is correct.

Is there a mathematically meaningful ordering? From the code, I guess that rigged configuration elements rely on list of tableaus, and if the tableaus can be ordered then one might have some kind of lexicographical ordering, perhaps. But this should be on a different ticket (and perhaps would not make sense at all), so that I would prefer to change the test.

Jeroen, what is your opinion on that matter?

@tscrim
Copy link
Collaborator

tscrim commented Mar 5, 2013

comment:80

Thankfully I'm here to take the bullet. :P

I don't know of any mathematical meaning of sorting rigged configurations. I don't think there is one...

The reason why there's different (unsorted) outputs is because at the backend of the iterator in TransitiveIdeal, there is a (python) set working. If this is changed to a list, there should be no real loss of speed, but the iteration should be completely consistent across all machines (and for anything else which uses TransitiveIdeal for iteration and does not have a valid sorting).

@simon-king-jena
Copy link
Member Author

Mark output of RiggedConfigurations.list() random and test against cardinality, to make it reproducible on different machines.

@simon-king-jena
Copy link
Member Author

comment:81

Attachment: trac_14054-fix-rigged-list.2.patch.gz

I have updated the patch, marking the test random and testing that the length of the list coincides with the cardinality.

Apply trac14054_fast_methods-5.8.patch trac_14054-fix-rigged-list.2.patch

@simon-king-jena
Copy link
Member Author

comment:82

PS: With the new patch, the tests will also pass (for me, at least) when I apply the remaining patches in my queue...

@tscrim
Copy link
Collaborator

tscrim commented Mar 5, 2013

comment:83

Everything looks good to me as well. I don't know why the patchbot just timed-out so I gave it a kick.

For patchbot:

Apply: trac14054_fast_methods-5.8.patch trac_14054-fix-rigged-list.2.patch

@simon-king-jena
Copy link
Member Author

comment:84

Hooray, and the patchbot doesn't complain either, except for the fact that a new module is loaded during start-up. But this can hardly be avoided in this case, and the patchbot does not observe the slightest increase in startup time :-)

@nthiery
Copy link
Contributor

nthiery commented Mar 6, 2013

comment:85

Replying to @simon-king-jena:

I have updated the patch, marking the test random and testing that the length of the list coincides with the cardinality.

Apply trac14054_fast_methods-5.8.patch trac_14054-fix-rigged-list.2.patch

Just for the record: in similar situations, I occasionally used the following idiom:

sage: sorted(..., key=str)
...

Which makes the output deterministic (well, assumming repr is ...) even if there is no natural total ordering on the objects to be sorted.

@nthiery
Copy link
Contributor

nthiery commented Mar 6, 2013

comment:86

I forgot to say: great job getting this done! Thanks!

@jdemeyer
Copy link
Contributor

jdemeyer commented Mar 6, 2013

comment:87

Replying to @simon-king-jena:

Can this really not go into 5.8, if the tests pass with the second patch?

Well, it didn't pass tests :-)

But seriously: is there any particular reason this patch should be in sage-5.8? I'd rather not take the risk and postpone it for sage-5.9.beta0.

@simon-king-jena
Copy link
Member Author

comment:88

Replying to @jdemeyer:

Replying to @simon-king-jena:

Can this really not go into 5.8, if the tests pass with the second patch?

Well, it didn't pass tests :-)

Well, it does, with the current version of the second patch.

@jdemeyer
Copy link
Contributor

jdemeyer commented Mar 7, 2013

Merged: sage-5.8.beta4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants