Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Insert or Update Data into a bit(64) in SQLAlchemy #112

Closed
ivanstepanovftw opened this issue Jan 6, 2025 · 2 comments
Closed

Cannot Insert or Update Data into a bit(64) in SQLAlchemy #112

ivanstepanovftw opened this issue Jan 6, 2025 · 2 comments

Comments

@ivanstepanovftw
Copy link

ivanstepanovftw commented Jan 6, 2025

I have a table "messages" (model shown below) with a BIT(64) column in PostgreSQL, and I'm trying to insert or update data into that column using SQLAlchemy (via pgvector-python). However, I'm getting an error about the bit string length not matching BIT(64). I've tried all of the following formats for the data:

  • Bit(embedding) (expects bytes-like object)
  • "010101" (expects bytes-like object)
  • b"010101" (bit string length 528 does not match type bit(64))
  • b'B"010101"' (bit string length 536 does not match type bit(64))

But no matter which approach I take, I see an error similar to:

ERROR:__main__:Error processing batch: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.StringDataLengthMismatchError'>: bit string l
ength 536 does not match type bit(64)                                                                                                                         
[SQL: UPDATE messages SET message_hash=$1, updated_at=now() WHERE messages.id = $2::BIGINT]                                                 
[parameters: [(b"B'0000100101000100000101001100001100011011010111000111101111111110'", 23189), (b"B'1000100001000100000101001100011100010011001110001101101111
011110'", 724950), (b"B'1011011100000100010001111110101111001011111011010110101111011010'", 103289), (b"B'1001101100010100010001100100001101011011001111011100
101111111010'", 90)]]                                                                                                                                         
(Background on this error at: https://sqlalche.me/e/20/dbapi)                                                                                                 
ERROR:__main__:Fatal error: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.StringDataLengthMismatchError'>: bit string length 536 d
oes not match type bit(64)

or

sqlalchemy.exc.DBAPIError: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.StringDataLengthMismatchError'>: bit string length 528 does not match type bit(64)
[SQL: UPDATE messages SET message_hash=$1, updated_at=now() WHERE messages.id = $2::BIGINT]
[parameters: [(b"'0000100101000100000101001100001100011011010111000111101111111110'", 23189), (b"'1000100001000100000101001100011100010011001110001101101111011110'", 724950), (b"'1011011100000100010001111110101111001011111011010110101111011010'", 103289), (b"'1001101100010100010001100100001101011011001111011100101111111010'", 90)]]
(Background on this error at: https://sqlalche.me/e/20/dbapi)

What I Tried

  1. Just giving a plain string of bits (e.g., "010101").
  2. Supplying the value in a B'...' literal form.
  3. Supplying raw bytes, like b"010101".
  4. Using a Bit object from pgvector.utils.Bit.

Nothing seems to work, and I get the mismatch length error.


Snippet of the Model

from pgvector.sqlalchemy import BIT
from pgvector.utils import Bit

class Messages(Base):
    __tablename__ = "messages"
    id: Mapped[int] = mapped_column(BigInteger, primary_key=True)

    # other columns...

    message_hash: Mapped[np.ndarray | None] = mapped_column(BIT(64), nullable=True)

Code Snippet

assert embeddings_hash.ndim == 1, f"{embeddings_hash.shape=}"
assert embeddings_hash.shape[0] == len(batch), f"{embeddings_hash.shape=} {len(batch)=}"
assert embeddings_hash.dtype == np.uint64, f"{embeddings_hash.dtype=}"

embeddings_hash = embeddings_hash.view(np.uint8)  # shape = (batch_size * 8,)
embeddings_hash = embeddings_hash.reshape(configuration["batch_size"], 8)
embeddings_hash = np.unpackbits(embeddings_hash, axis=1) 

updates = [
    {
        "b_id": message.id,
        "message_hash": f"B'{Bit(embedding_hash.reshape(-1)).to_text()}'".encode("ascii")
    }
    for message, embedding_hash in zip(batch, embeddings_hash)
]

await session.execute(
    Messages.__table__.update()
    .where(Messages.id == bindparam("b_id"))
    .values({"message_hash": bindparam("message_hash")})
    .execution_options(synchronize_session=False),
    updates
)

Expected Behavior
I want to store a 64-bit value in the BIT(64) column without errors.


Actual Behavior
PostgreSQL complains about the bit string length not matching BIT(64), or asking bytes-like object.


Environment

  • Python 3.12.3
  • PostgreSQL 17
  • SQLAlchemy==2.0.36 (using async)
  • asyncpg==0.30.0
  • pgvector==0.3.5
@ankane
Copy link
Member

ankane commented Jan 6, 2025

Hi @ivanstepanovftw, see #110 (and please check existing issues before creating new ones).

@ankane ankane closed this as completed Jan 6, 2025
@ivanstepanovftw
Copy link
Author

Fix is

+from asyncpg import BitString
...
updates = [
    {
        "b_id": message.id,
-        "message_hash": f"B'{Bit(embedding_hash.reshape(-1)).to_text()}'".encode("ascii")
+        "message_hash": BitString(Bit(embeddings_hash[0].reshape(-1)).to_text())
    }
    for message, embedding_hash in zip(batch, embeddings_hash)
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants