-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing long literals fails with cryptic error #158
Comments
Hello @rasendubi! That error is a bit cryptic indeed, I'll work on making it more explicit. As for the issue per-se, the underlying problem is that the serialization mechanism requires the lengths of each term to be added at the end of the serialized quad, so as to be able to efficiently reconstruct terms upon index reads. Supporting unlimited lengths would require scanning for separators, which would make deserialization significantly slower. The current limitation of 9999 chars is due to the fact that quadstore reserves 4 characters / digits per each of these lengths, a tradeoff that covers the vast majority of cases. Also, consider that each quad (and therefore each term in a quad) appears in the database as many times as the number of configured indexes, meaning a 20_000 chars term would actually use 20_000 * 6 = 120_000 when the store is configured with the six default indexes. Presently, storing terms of arbitrary lengths can be done by a little bit of pre-processing at ingestion time and post-processing at query time via quadstore's The |
I wonder if it'd be worth shipping this mechanism within quadstore itself, perhaps making it an opt-in. I've never had to store terms anywhere near that long but yours is the third use case for extra-long terms that I have come across. |
I believe this has some (un)expected consequences like range queries stopping working? Not that I expect to range over these long values, but that might be a counter-argument for including this by default. Another option is changing the length encoding. Decimal coding is quite inefficient and only allows coding 10,000 options in 4 bytes. If we encode length in binary, we could get up to 4Gb with the same 4 bytes. This has a downside of requiring us to process serialized values as byte strings (which we probably should). If that is undesirable for some reason, we can use hex (64Kb), base32 (1Mb), or base64 (16Mb) encodings. It's also possible to use variable-length encoding (e.g., base-128) to support unlimited lengths and simultaneously decrease the average size. Most of the values (up to 128 bytes) would only need one byte to encode their length. These are obviously not backward-compatible and would mess-up all existing databases. One way to fix that is to exploit the fact that current encoding has a lot of bits always set to 0. We can sacrifice one bit to mark the encoding (0 for old, 1 for new). This would halve the available length but would keep it backward-compatible. |
@rasendubi I love this - lots of good ideas in your comment. I've experimented quite a bit with binary encoding of lengths but couldn't find a way to do so without having to resort to I had never even thought of using hex, base32 or base64! Do you reckon we could swap the current I have thought in the past about using variable-length length encoding but ultimately put that on the backburner as I had more pressing priorities. So far, fixed-length length encoding has made for a good tradeoff between performance, simplicity and storage space. All that said, I'd be more than happy to revisit all of the current assumption and choices and evaluate new ways of doing things.
This is not a big issue. Quadstore is the first project of this kind I have ever worked on, both when it comes to databases and when it comes to RDF/JS. It is bound to evolve quickly. Most major versions have had breaking changes when it comes to the (de)serialization mechanism. We could slot these changes in for the next major release. What I'm not too keen on is major regressions in performance and readability. |
Correct, although there might be a way to work around this as we'd be able to add a dedicated (de)serialization type for long literals in places like https://github.com/belayeng/quadstore/blob/master/src/serialization/terms.ts#L210 |
Worth noting that a very simple tweak would be to use a different radix while keeping the current approach: parseInt('zzzz', 36) // returns 1_679_615
(1_679_615).toString(36) // returns 'zzzz' |
Quickly tested using radix 36 in https://github.com/belayeng/quadstore/tree/encoding-lengths-using-radix-36 , performance remains exactly the same . |
4-byte radix 36 should give us 1.6Mb length limit—certainly good enough for my use case and seems like a very quick fix 👍 It would be interesting to experiment/benchmark other encodings though (especially variable-one—kind of feel that it has good potential) but that's likely a longer project |
@rasendubi I'll release a 13.x alpha version later tonight with this fix. In the meantime, you can checkout the branch linked above and |
@rasendubi just published |
Just tested with my data and it works fine |
It certainly is! I'm starting work on the integration of a text CRDT into m-ld (m-ld/m-ld-spec#35). It's very likely I'll be making more use of Many thanks for keeping me tagged & and sorry to be late to the party, looks like you found a nice solution for your use-case. |
FYI, just released version |
I'm trying to import my dataset but quadstore fails with a cryptic "too long" error.
I found this to be caused by a long literal in my dataset and it looks like the limit is 9999 bytes. Is it possible to support larger values?
Minimal repro case:
The text was updated successfully, but these errors were encountered: