Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strategies to prevent OOM when retrieving large field values #3405

Open
avallete opened this issue Mar 17, 2025 · 2 comments
Open

strategies to prevent OOM when retrieving large field values #3405

avallete opened this issue Mar 17, 2025 · 2 comments

Comments

@avallete
Copy link

avallete commented Mar 17, 2025

First, thank you for maintaining the pg library. I'm having an issue dealing with large field values that I'd like to get community guidance on.

Problem Description

When working with PostgreSQL tables that contain very large field values (text columns with hundreds of MB or more), we face a risk of out-of-memory (OOM) errors. Even when using pagination techniques like cursors, the library loads entire rows into memory, which becomes problematic with extremely large fields.
Our ideal solution would be a way to set a maximum threshold for query result size and automatically abort when exceeded, preventing memory issues.

Reproduction Case

Here's a minimal script demonstrating the issue:

const pg = require('pg')

async function demonstrateMemoryIssue() {
  // Create pool
  const pool = new pg.Pool({
    connectionString: 'postgresql://postgres:postgres@localhost:5432/postgres',
  })

  try {
    // Create test table and insert a large row
    await pool.query('DROP TABLE IF EXISTS memory_test')
    await pool.query(`
      CREATE TABLE memory_test (
        id SERIAL PRIMARY KEY,
        name TEXT,
        large_data1 TEXT,
        large_data2 TEXT,
        large_data3 TEXT,
        large_data4 TEXT,
        large_data5 TEXT,
        large_data6 TEXT,
        large_data7 TEXT,
        large_data8 TEXT,
        large_data9 TEXT,
        large_data10 TEXT,
        large_data11 TEXT,
        large_data12 TEXT,
        large_data13 TEXT,
        large_data14 TEXT,
        large_data15 TEXT,
        large_data16 TEXT,
        large_data17 TEXT,
        large_data18 TEXT,
        large_data19 TEXT,
        large_data20 TEXT
      )
    `)

    // Insert a single row with 20 columns of 500MB each (total 10GB)
    console.log('Inserting large rows...')
    await pool.query(`
      INSERT INTO memory_test (
        name, 
        large_data1, large_data2, large_data3, large_data4, large_data5,
        large_data6, large_data7, large_data8, large_data9, large_data10,
        large_data11, large_data12, large_data13, large_data14, large_data15,
        large_data16, large_data17, large_data18, large_data19, large_data20
      ) 
      VALUES (
        'Ultra Large Row',
        repeat('A', 500 * 1024 * 1024), repeat('B', 500 * 1024 * 1024),
        repeat('C', 500 * 1024 * 1024), repeat('D', 500 * 1024 * 1024),
        repeat('E', 500 * 1024 * 1024), repeat('F', 500 * 1024 * 1024),
        repeat('G', 500 * 1024 * 1024), repeat('H', 500 * 1024 * 1024),
        repeat('I', 500 * 1024 * 1024), repeat('J', 500 * 1024 * 1024),
        repeat('K', 500 * 1024 * 1024), repeat('L', 500 * 1024 * 1024),
        repeat('M', 500 * 1024 * 1024), repeat('N', 500 * 1024 * 1024),
        repeat('O', 500 * 1024 * 1024), repeat('P', 500 * 1024 * 1024),
        repeat('Q', 500 * 1024 * 1024), repeat('R', 500 * 1024 * 1024),
        repeat('S', 500 * 1024 * 1024), repeat('T', 500 * 1024 * 1024)
      )
    `)

    console.log('Fetching all rows...')
    // Even with a cursor, this will cause memory issues
    const client = await pool.connect()

    // Try with cursor approach
    const Cursor = require('pg-cursor')
    const cursor = client.query(new Cursor('SELECT * FROM memory_test'))

    // Read in small batches (still causes memory issues)
    cursor.read(10, (err, rows) => {
      if (err) {
        console.error('Error reading rows:', err)
      } else {
        console.log(`Retrieved ${rows.length} rows`)
        console.log(rows)
        process.exit(0)
      }
    })
  } catch (error) {
    console.error('Error:', error)
  }
}

demonstrateMemoryIssue()
// run with: /usr/bin/time -l node script.js

Attempted Solutions

I've tried:

  • Using cursors with small batch sizes, but as we see, cusor work row by row, and a single row could be gigabytes large not preventing the OOM of the program
  • Try to plug into pg events to keep track of the current query buffer size (couldn't find a way to do this)

Questions

  • Is there a recommended approach to cap the maximum data size returned by a query to prevent OOM error within the caller program, and keep any query result size bellow a specific amount of memory before gracefully aborting ?

Thank you for the time you'll take answering my question and pointing me in the right direction.

Edit: After searching in older issues, I discovered that this issue is actually a duplicate of: #2336.

In addition to what was mentioned there, my use case involves an application where the server running the query and the queried database belong to different actors.

If there is not any new known way to handle this, and if the community is still open to reviewing contributions along those lines, I could try to develop a solution for this use case by integrating at the pool connection level, as mentioned in the comments.

@avallete
Copy link
Author

I've tried to poke a bit with it. And while implementing things at the "connection" level for each message might help mitigate those issues. I think there is still one uncatchable error which is related to how the underlying libpq work.

Tell me if I'm wrong but it seems that libpq always retrieve the full row in memory before it even reach node-postgres. Meaning than in the case I mention (one row with very large field), we'll end up with a OOM not because of node-postgres but because of libpq itself.

The global limit might still help as long as a single row doesn't OOM the program execution tough.

With a 100MB limit and a row like this I get the right expected error:

      INSERT INTO memory_test (
        name, 
        large_data1, large_data2, large_data3, large_data4
      ) 
      VALUES (
        'Ultra Large Row',
        repeat('A', 50 * 1024 * 1024), repeat('B', 50 * 1024 * 1024),
        repeat('O', 50 * 1024 * 1024), repeat('P', 50 * 1024 * 1024)
node:events:495
      throw er; // Unhandled 'error' event
      ^

Error: Query result size exceeded the configured limit
...
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  code: 'RESULT_SIZE_EXCEEDED',
  resultSize: 209715310,
  maxResultSize: 104857600
}

With the same limit and data like this I get:

      INSERT INTO memory_test (
        name, 
        large_data1, large_data2, large_data3, large_data4
      ) 
      VALUES (
        'Ultra Large Row',
        repeat('A', 1000 * 1024 * 1024), repeat('B', 1000 * 1024 * 1024),
        repeat('O', 1000 * 1024 * 1024), repeat('P', 1000 * 1024 * 1024)
      )
Inserting large rows...
Error: error: invalid memory alloc request size 4194304092
    at /<redacted>/test-node-postgres/node_modules/.pnpm/pg@file+..+..+node-postgres+packages+pg/node_modules/pg/lib/client.js:546:17
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async demonstrateMemoryIssue (/<redacted>/test-node-postgres/index.js:45:5) {
  length: 95,
  severity: 'ERROR',
  code: 'XX000',
  detail: undefined,
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'mcxt.c',
  line: '1103',
  routine: 'palloc0'
}

I wonder if there might be a way to handle this gracefully in libpq as well. That might increase largely the scope of a PR tough.

@sehrope
Copy link
Contributor

sehrope commented Mar 18, 2025

Unless you're explicitly using the native bindings, you wouldn't be using libpq with this module. The default is to use the pure-JS driver which does not call out to libpq. So any allocations would be handled by node and the allocation error would be from the node runtime.

If you are using the native bindings, I don't think there's any option in libpq we could enable to set a max or gracefully handle it.

The idea itself is doable though, in both pure-JS (probably easiery) and libpq. The PostgreSQL wire protocol ("FEBE") sends each row in it's own data message: https://www.postgresql.org/docs/current/protocol-message-formats.html#PROTOCOL-MESSAGE-FORMATS-DATAROW

And every FEBE message in PostgreSQL has a type and message length at the start of the message. So it should be possible to have a connection level max message size. Anything bigger could be discarded. The error handling might be tricky (we'd have to create a error for the client, finish reading or close the result, then sync) but it's definitely doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants