-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BACKEND: Blank nodes should not match existing data #124
Comments
I just asked on the RDF/JS gitter room about this. I don't remember reading anything about blank node collisions in the low-level spec but, ideally, this is something that all implementations should address in a uniform way. As for the rest of the quadstore API, what would be the scope of our "collision avoidance" strategy? Collision avoidance in a single I suspect that this would be a good use-case for something like what Node.js has done for http agents and TCP connection pooling (https://nodejs.org/dist/latest-v14.x/docs/api/http.html#http_new_agent_options and https://nodejs.org/dist/latest-v14.x/docs/api/http.html#http_http_request_url_options_callback). We could modify all write methods to receive an optional In your example, omitting this object or passing two different instances would lead the store to process those labels into non-colliding ones. However, passing the same instance to both |
Can't you have internal IDs for blank nodes like Jena does? |
@namedgraph thank you for pitching in! Yes, I think we'll end up storing both some sort of internal id plus the original label to preserve the latter while avoiding collisions through the former. However, I'd like to provide a mechanism allowing for re-utilization of the same internal id across different writes when needed. For example, it is likely that importing from a stream will require blank nodes with the same label to end up having the same internal id, even though (in our case) importing from a stream happens through separate writes of one or multiple quads. In this case, we would need to find a way for quadstore to remember both the label and internal id of previously written blank nodes so that encountering the same label in a different write would lead to the same internal id being used. |
Is it necessary to remember the original label? I don't think Jena does that. Nor any other store that I can remember really. |
True, we only need to remember original labels insofar as we're looking for them while performing further write operations. We don't need to store them as returning them could lead to the very collisions we're trying to avoid. |
Personally I think the first incremental step is for scope to exactly track API calls, with no changes. So:
IMHO it's dangerous to separate atomicity from scope – you could end up in a big pickle with errors and crashes. |
This would break the fairly common use-case of streaming quads from a file into a store. I do share your concern WRT separating atomicity from scope but I think that |
@gsvarovsky In your example, you're actually creating named nodes instead of blank nodes.
should become
But even then you'll probably get just a single result as test result. This does seem like an expected outcome to me though. You could easily fix your case by calling the following to create blank nodes with unique labels:
Alternatively, a higher-level insertion mechanism like SPARQL update could be used, which takes care of bnode scoping. |
Whoops! corrected by edit. Yes, the outcome is the same. If every call to This would require a new API feature, like an inverse of @jacoscaz's Perhaps another approach would be to offer an explicit transaction API like Jena. A transaction is both atomic, and defines a document scope for blank nodes. Internally this would use a sustained Leveldown However this still doesn't fully solve the file streaming case, if it's a big file that doesn't fit in memory and so must be processed in multiple transactions. For this case I think skolemization is the best approach. The file reader replaces each blank node with a |
Thank you @gsvarovsky, @namedgraph and @rubensworks for pitching in! Based on your arguments, I would be inclined to do the following:
The final result would be something like the following: const scope = scopingLibrary.createScope();
const scopedQuads = scope.process([ /* RDF/JS quads */ ]);
store.multiPut(scopedQuads); const scope = scopingLibrary.createScope();
store.putStream(scope.createProcessingStream(rdfjsQuadStream)); What do you think? |
The scoping library could even be designed in such a way as to be able to bootstrap scopes from a store and serialize scopes to RDF/JS quads to be persisted to the store atomically with the quads being processed. |
Interesting. Perhaps go even further. Should scope be a first-class citizen in rdfjs? In my recent travels I have been frustrated by this concept being not well defined (of course, I could just have missed some important reference). Is it worth raising this with the wider community? https://www.w3.org/2011/rdf-wg/wiki/User:Azimmerm/Blank-node-scope |
As a concrete use-case, for consideration. As an application, I generate a JSON-LD document containing a sub-structure defined as a I process a document using a JSON-LD processor, which generates an RDF list, containing blank nodes. I use quadstore to, erm, store the quads. Then, I restart, and do the same operations with a new JSON-LD document in a new session. The same blank node labels are generated, and the list data from the first document is corrupted. (I am just starting to work on list support in m-ld, and I may force skolemization, so I may not need any special support from quadstore. I will keep you updated if any definite requirements arise. Thanks, as always, for your collaboration!) |
@gsvarovsky I think that your JSON-LD example is a perfect representation of how the RDF ecosystem can often feel counter-intuitive for those who come to it from a non-academic background (like myself). Reading those proposals makes me more convinced about my own proposed solution as I think the best way to counter the lack of a clear definition of blank node scoping is forcing developers to explicitly define their own scopes whenever needed. Making scoping as explicit as possible would lower the cognitive barrier to entry IMHO. EDIT: I hadn't realized that the expression "utterly bananas" could be interpreted to have racist undertones - oops! |
@gsvarovsky when you have a moment, could you please have a look at the |
Looks elegant, @jacoscaz. Some thoughts: 1. Index sizeEach 2. RestartIf I'm in the middle of creating structure using blank nodes with a scope, and the process crashes, I am in a big pickle. Even if I have tracked my position in the data upload, I don't know what blank node identifiers were used.
3. ExportInternal blank node identifiers are exposed when reading from the quadstore. This makes them effectively skolemised, because they can be used in new data to link to existing data. However, if you use a scope when inserting the new data, they will lose their identity again. At this point, intuition has taken many steps in its long walk on a short pier. 4. Default ScopeThe regular write methods don't use a scope, so the blank nodes go in verbatim as before. This means that if there is any chance of blank nodes in your dataset, you have to be very careful to read the scope documentation. In other words, the default behaviour is still incorrect IMO. On the other hand, since blank nodes are so ugly already maybe this is fine. Ideas
|
@afs might have some insight here. TL;DR - don't like how bnodes work - don't use bnodes :) |
Hi all!
A nanoid-labeled blank node is still significantly smaller than the average named node and seems to be comparable to shortened named nodes when using prefixes. I don't think slightly longer blank nodes are likely to become an issue on their own unless as a part of a bigger issue related to the comparatively low
I do agree that the default behavior is not correct but it's also simple to maintain, easily understood and easily extendable. Furthermore, I suspect that it matches expectations of how a low-level RDF/JS library should work as per @rubensworks comment. I think that forcing a scope when none is provided would break a lot of assumptions, both spoken and unspoken.
I agree in principle but I can't come up with a sane way to do this without adding unreasonable amounts of complexity.
At what point should a scope be persisted? For example, imagine we're In any case, const scopeId = await store.createScope(); // inits a new scope
const scopeId = await store.loadScope('some-id'); // re-hydrates a previously-created scope
await store.putStream(stream, { scope: scopeId }); // updates the scope with each new blank node
await store.multiPut(quads, { scope: scopeId }); // updates the scope with each new blank node
await store.deleteScope(scopeId); // drops the scope Does it even make sense to provide scoping support without persist-able scopes?
I think this is a valuable suggestion, @namedgraph. It could be that scoping is simply too dependent on each specific use-case to be easily implemented in a low-level library such as quadstore. |
WRT to a more integrated API, my example works better with explicit scope objects: const scope = await store.createScope(); // inits a new scope
const scope = await store.loadScope('some-id'); // re-hydrates a previously-created scope
scope.id; // can be used as a reference to re-hydrate the scope through store.loadScope()
await store.putStream(stream, { scope }); // updates the scope with each new blank node
await store.multiPut(quads, { scope }); // updates the scope with each new blank node
await store.deleteScope(scope); // drops the scope, can also accept a scope id |
Had a bit of time today so I decided to give a go at the API from my previous comment, addressing what I think is the most critical point:
I ended up using something very similar to what I am surprised - this has basically no effect on import performance but it still allows scopes to be reloaded at a later time without issues (https://github.com/beautifulinteractions/node-quadstore/blob/e3362a85fa24d4e93a49b6a3e432ac092dac340e/README.md#quadstoreprototypeloadscope ). @gsvarovsky when / if you have a moment, your feedback would be most welcome. |
Hi @jacoscaz, great news that the persistent scope is not a significant performance bottleneck. It looks great & the API with the scope in the Just for your interest (I should have mentioned it before) m-ld deals with a similar situation. Nothing to do with blank nodes (we skolemise), but in a replicated dataset, operations can be incoming from other clones at any time. We therefore provide an API that holds the current state as immutable, to allow the app to make a consistent set of edits. This 'immutable state' is captured in the API as an interface. In principle this is similar to a scope – a way of bounding operations. The way this is arranged in m-ld is to have the scope-like The significant idea is that the clone/store itself implements the data operations too, for simple use-cases. So you have the choice whether to just make individual operations on the mutable clone, or use an immutable state. Just a thought. The current API makes sense and seems very usable. |
@gsvarovsky if I understand correctly, what you're describing is similar to LevelDB's snapshotting feature, which some implementations of Very cool to see that you've replicated such a feature at the application level and in a distributed manner! Thank you for mentioning it, this might come useful in the future. For the time being, I'm happy to piggy-back on the |
Published in version |
When inserting data containing blank nodes, the blank subject or object is stored verbatim with the same blank node identifier as the input. This breaks the requirement that blank nodes are scoped to the input document. For example (I tried adding this as a unit test in
quadstore.prototype.put.js
):This test fails because the two invocations of
put
are using the same blank node label. Instead, they should result in different quads with disjoint subjects.For more complex examples, such as lists, the accidental re-use of blank node identifiers (for example after a re-start) could badly affect data integrity.
The text was updated successfully, but these errors were encountered: