Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exists() method for Checkpoints class #142

Closed
KrishnaPG opened this issue Jun 10, 2015 · 15 comments
Closed

exists() method for Checkpoints class #142

KrishnaPG opened this issue Jun 10, 2015 · 15 comments

Comments

@KrishnaPG
Copy link

Presently the Checkpoint class has list_checkpoints() method that is being used to verify if a check point exist for a given file or not, such as here:

                # One checkpoint should always exist for notebooks.
                if not self.checkpoints.list_checkpoints(path):
                    self.create_checkpoint(path)

In some scenarios listing all checkpoints is too costly affair when we are just interested in only checking the existence of atleast one.

Perhaps the Checkpoint class should have checkpoint_exists_for(path) kind of mechanism.

(PS: Originally posted here)

@ssanderson
Copy link
Contributor

@KrishnaPG I'm curious under what circumstances this is too expensive? In the default Jupyter ContentsManager, list_checkpoints is basically just os.listdir. Do you have an alternate Checkpoints implementation in mind/in progress?

@KrishnaPG
Copy link
Author

Thank you @ssanderson

  • In many cases, checking for existence is O(1) while listing all is O(n).
  • Besides, existence checking results could be cached (quite longer) than content-listing results (which gets invalidated faster). In some case, we have seen more 'checkpoints' than 'autosaves' for notebooks for few users.
  • Also, usually list_checkpoints() implementation will have internal sorting included in the computation to return results temporally ordered, which is O(n logn). (The sorting might happen at app-level or db-level).

In scenarios where the older checkpoints get moved to secondary (slower) storage, there is no way to tell the list_checkpoints method that we are not really interested in all the results (including the secondary storage ones), but just one.

We do not see it as impossible for a user to create 25+ checkpoints throughout the life of a notebook, where as his active "working set" would be only last 5 or 10 checkpoints. A typical Redis implementation would hold these 'working set' in memory, while pushing all the older ones to secondary storage.

Ofcourse, One can certainly circumvent those problems by always trying to have all the checkpoints always available in the main memory ready to be listed etc., but then it looks like having an "exists()" method is more simpler and semantically valid solution.

@KrishnaPG
Copy link
Author

Another point worth noting: the exists method need not be a "must to implement" method.

It could be just an ordinary method with default implementation that falls back to list_checkpoints such as:

class Checkpoint :

    def exists_checkpoints_for(path):
        return not not list_checkpoints(path)

    def list_checkpoints(path):
        raise exception "must be implemented"

This way, it is backward compatible and also does not put pressure on implementor who do not want to take advantage of that exists method.

@takluyver
Copy link
Member

OK, I think this is a reasonable addition. @KrishnaPG , do you want to make a pull request adding it. There might be some bikeshedding over naming.

bool() is a neater way of doing not not.

@ssanderson
Copy link
Contributor

@takluyver vote on naming would be def exists(path):.

👍 for bool over not not.

@takluyver
Copy link
Member

I think it might be better to be a bit more specific, because there are some different possible 'exists' questions you might ask of the checkpoints implementation:

  • Does any checkpoint exist for this file? (what we're talking about here)
  • Does this specific checkpoint (id) exist?
  • Does a checkpoint exist meeting some set of criteria, e.g. created since ?

But it's not a strong preference, and I don't currently have a name I like better for it.

@KrishnaPG
Copy link
Author

Thanks @takluyver and @ssanderson .

Forgive me for the not not.

It was not my intention to promote it - its just that not being familiar with python or its type-conversion system, I used the not not merely to demonstrate the concept. It was just one of the old techniques from the days of x86 assembly / C to force-cast the objects to bools.

But I see from your messages that python has this bool() to take care of the conversion - in which case it is better to stick with bool().

As for the naming, I have no preference either. Whichever works for you, is fine with me. But as @takluyver pointed, the semantics certainly need to be clear for the user. For example, something on the lines of has_checkpoints() is also not far from the concept.

BTW, does python support polymorphic functions (where multiple methods can have same function name with different parameter types). In that case multiple methods with same name such as below may satisfy @takluyver 's criteria:

has_checkpoints(for_this_path) :
     #some code

has_checkpoints(for_this_id):
    # some code

has_checkpoints(for_this_condition):
    # some code

If not, what is the typical way Python achieves the above kind of mechanism?

One simple way would be - if the path, id etc. were objects themselves on which we can apply `has_checkpoints'. For example,

    def has_checkpoints(pathString):
          return PATH_Class(pathString).has_checkpoints();

where PATH_Class is a light-weight wrapper that creates a path object that has specialized has_checkpoints() defined on it from the user supplied pathString.

The PATH_Class then be defined as a Class Trait ( mixin in pyton?) for the checkpoint class.

@ssanderson
Copy link
Contributor

BTW, does python support polymorphic functions (where multiple methods can have same function name with different parameter types). In that case multiple methods with same name such as below may satisfy

This isn't possible in python without adding some extra machinery (see below for an example). If you define two functions with the same name in a given namespace, the second just overwrites the first.

If not, what is the typical way Python achieves the above kind of mechanism?

You can write something like:

def some_function(obj):
    if isinstance(obj, str):
        # do str things
    elif isinstance(obj, int):
        # do int things
    elif isinstance(obj, float):
        # do float_things
    ...

I'd say this is generally considered a bit of an anti-pattern though unless you've got a good reason for wanting to accept multiple input types. For cases where it makes sense (e.g. when you're implementing an addition function on various number-like types), there are libraries like @mrocklin's multipledispatch that let you specify type signatures as decorators. functools in recent versions of Python 3 stdlib also supplies a singledispatch decorator that implements this behavior for functions that accept a single argument.

For this particular case, dispatching on type for an exists function wouldn't be useful anyway, because the possible inputs would either by checkpoint_id or path, both of which are strings. If for some reason it became important to want to check existence for a path and for a string, I'd say the right thing to do would be to just create separate methods, say checkpoint_exists_with_path(path) and checkpoint_exists_with_id(id). A bit more verbose, but immediately clear to everyone what's happening.

@KrishnaPG
Copy link
Author

Sorry for the late reply.

So, what is the action plan? Shall we go ahead with the name def exists(path) ? Or some alternative (say, has_checkpoints or has_history etc..)?

Also, I would like bring the issue of adding 'descriptions' metadata to the files and checkpoints. I know the issue has been put in halt at #130 due to performance reasons.

However, for alternate content manager classes (such as mongodb / postgres based managers) being discussed here, which are not file-based, reading the metadata info (such as the descriptions) happens anyway as part of reading the filename and related info. So no additional overhead.

Primarily my concern is about 'checkpoints'. Listing 10 different checkpoints with different timestamps in the menu is not so effective as listing checkpoints with their 'commit messages'.

So, my question is: assuming that performance overhead is not a problem (let the content /checkpoint manger worry about it), what is the change required to add these descriptions for files and checkpoints (architecture wise)?

For poor-performers like the file-based content manager, they can always skip reading the description and return "" (empty string) for the description, if they want.

What are your views?

@Carreau
Copy link
Member

Carreau commented Jul 1, 2015

  1. I would write an API that work (when possible) on my objects. like def exists([path]) or def exists([checkpoints_ids]) If we expose for 1 people will want to play with many,

  2. Beyond the programatic API, we should think at the REST api too.

  3. would list_checkpoint taking a upper and lower limit works ? so that you can if needed "Page" the requests.

It kinds of depends on what people want to do.

With extensions, I suppose it is still possible to write that externally, as another API endpoint for the time beeing, and iterate on it. We are about to release 4.0, so we will not change API right now.

@KrishnaPG
Copy link
Author

Not a problem @Carreau You guys know how to do it right and when to do it. So, will leave it in your good hands and hope to see it implemented sometime soon.

Meanwhile I will try to keep adding my views/comments/suggestions on other list of features /wishlist from architecture stand point (which is what I do better), as we keep encountering them in our own use of this product.

I thank you and all your other project members for taking time to review issues such as this and others, and also for providing this good opensource work/project for community. It is helpful.

@Carreau
Copy link
Member

Carreau commented Jul 1, 2015

You guys know how to do it right and w

Not always, and our bandwidth is pretty limitted. It would be (relatively) easy to get more experience than one of us on a specific part of the project. I would expect @ssanderson to be the more experienced on the content manager side.

Draft implementation are alway welcome, they at least often show issue in implementation. We almost never get things right the first time.

I thank you and all your other project members for taking time to review issues such as this and others, and also for providing this good opensource work/project for community. It is helpful.

No problem, I know we are not always super responsive and we can seem reluctant sometime, maybe out of extra caution.
If you want to try to make that as an extension to play with it, we can give you pointers.

@minrk minrk added this to the 5.0 milestone Sep 11, 2015
@minrk minrk modified the milestone: 5.0 Jan 13, 2017
@JamiesHQ
Copy link

@gnestor : was this implemented in 5.0? What are the next steps? Thanks!

@gnestor
Copy link
Contributor

gnestor commented Apr 25, 2017

@KrishnaPG Are you still interested in contributing this feature? Marking as backlog for now...

@gnestor gnestor added this to the Backlog milestone Apr 25, 2017
@gnestor
Copy link
Contributor

gnestor commented Dec 2, 2017

Closing due to inactivity...

@gnestor gnestor closed this as completed Dec 2, 2017
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants