Exploring the limits of Postgres: when does it break?

In this post, we explore the limits of single-node Postgres and what optimizations can extend its usefulness as a primary application database before requiring more advanced scaling solutions.

Monolithic ecommerce app on RDS PostgreSQL

Let's use a real-world scenario, a monolith web application that uses PostgreSQL as the application database running on AWS RDS. The RDS instance is a 2xlarge.

The resources are:

8 vCPUs
64 GB Memory
20,000 Max I/O operations per second (IOPS)

In our ecommerce app, we display products for sale, where shoppers can add them to their carts and place orders. We want to support large amounts of concurrent users who are shopping, and avoid high latency for their requests that might reduce our sales.

Imagine we're doing architectural planning, and want to estimate what a 10x increase in concurrent users would look like with our current PostgreSQL instance. Let’s factor in the cost of server instances with a low and high value, so we can place our current usage in between.

We’ll use instances.vantage.sh to explore instances, gathering resource and cost information. We'll choose on-demand pricing, per month, in the US East region. The 2XL instance costs $650/month.

Let’s choose a huge instance as an example of a high value. We'll use a 32XL instance (db.x2iedn.32xlarge) which has these resources:

128 vCPUs
4096GB (4 TB ) Memory
260,000 IOPS

Using the same cost calculation inputs above, Vantage shows us this instance costs $52,000 per month! Let's recap what going from the small to large instance would look like.

$650/month to $52,000/month, 80x more expensive
20K to 260K IOPS, a 13x IOPS availability increase
8 to 128 vCPUs, 16x more vCPUs
64 GB to 4096 GB, 64x more memory

For 80x more monthly cost, we can purchase a lot more server resources. Server resources we're most concerned with are CPU, memory, and IO operations.

Let’s focus on IO operations, as we can attempt to translate application level interactions to SQL operations, then SQL operations to IO.

When would we run out of IO?

When would we run out of IO operations?

Let's think about some ways our ecommerce application writes and reads data. For our ecommerce app, write operations could come from behind-the-scenes operations like interactions with suppliers, adding products for sale. Writes would come from customer activity, where customers are adding products to their wishlist or shopping carts.

Update operations could reflect product inventory changes. When products sell, their availability quantities must be reduced. Update or delete operations would happen when customers change their product quantities to purchase.

Between supplier and customer actions, we'd expect many more from customers. All of these application interactions eventually are turned into SQL operations like SELECT, INSERT, UPDATE, and DELETE with the application database, and those operations that read and write count towards the IOPS quota that's available.

The average duration of our IO is a factor. A slow app consumes fewer resources while a fast app consumes more in an equivalent time period.

Let's estimate an average duration. If every operation took 1ms, we'd expect 1000 operations in 1 second. However, that's unrealistically low. Let's choose an average duration that's more realistic. How can we do that?

Estimating request proportions and durations

For web apps, usually they have many more read operations than write operations. Let’s use a ratio of 5 read operations to 1 write operation or 5:1. That means that over 1000 IO operations, we’d expect 800 reads and 200 writes.

Let's pick more "realistic" database durations. For writes, let's go with 10ms and for reads, let's go with 100ms. More realistic than 1ms, but still pretty fast!

Let’s combine the writes and reads durations. With 1000 operations, we'd estimate achieving 100 *write* ops/second at 10ms each, and 10 read operations at 100ms each. That’s 110 total ops/second.

At 1 concurrent user, with that number we’d expect 110 ops/second. At 10 concurrent users, we’d expect to see around 1100 ops/second.

1 user, 110 IOPS
10 concurrent users, 1100 IOPS

Let's keep working with these numbers to project concurrent user growth.

Concurrent users, IOPS consumption, instance costs

Our 2XL instance supports 20,000 IOPS. What amount of concurrent users might that support? Dividing 20,000 by 110 operations/second, we could expect to support around 180 concurrent users on the system.

Let’s imagine our ecommerce app currently has around 100 concurrent users now. Using our projections earlier, they’d generate around 11,000 IOPS, which is inside the budget of 20,000 IOPS.

We're planning for 10x concurrent user growth, in other words moving from 100 concurrent users to 1000.

At that level, we’d expect around 110,000 IOPS, well past the IOPS quota for the 2XL instance. Here's where the numbers are at:

Current concurrent users: 100, generating 11,000 IOPS, current quota: 20,000 IOPS, inside quota
10x growth, 1000 concurrent users, 110,000 IOPS, well past the quota for the 2XL instance

How about that huge 32XL instance? The 32XL allows for 260,000 IOPS. Using our estimate of 110,000 IOPS, we'd be well inside the quota. But at what cost? We'd be spending 80x more, going from $650/month to $52,000/month.

Instead of that type of increase, maybe we could perform optimizations so that our application queries consume fewer resources, extending the life of our current instance, or avoiding such costly increases.

Which operations would we start to optimize? How many IOPS are consumed by different types of SQL operations?

Translating IOPS to SQL Operations

Our optimization efforts would be focused on SQL query elimination and optimizations, which would indirectly reduce IOPS.

How do SQL operations map to IO operations? Let's explore each operation type and create a rough mapping.

For SELECT queries, let's assume a worst case scenario and each one causes a disk read. Given that, we'd estimate 1 read IO per SELECT.

For INSERT operations, more than 1 write IO ops are needed. That's because for each logical table row, there are also indexes to maintain and write ahead logging (WAL) responsible for data durability.

Let’s imagine a products table our app inserts into. If that table has 5 indexes, each index receives a write operation to be maintained. To total it up, we'll count 1 IO operation for storing the table row, 1 each for all 5 indexes, and 1 for WAL, for a total of 7.

For UPDATEs, we’ll imagine there’s a read as well, plus all the IO from the INSERT. We will go with 8 IOPS for an UPDATE.

For DELETE operations, let's go with 5 IO operations.

With that mapping as a rough guideline, we can begin targeting each operation type, and looking for ways to eliminate and optimize.

Once we've done that, we'll still want to plan for the concurrent user activity growth. What would that look like?

Scaling into thousands of queries per second (QPS)

Postgres works best with high frequency, short duration queries, usually classified as online transaction processing (OLTP).

One of the stress points for Postgres can become concurrent connections. While frameworks like Ruby on Rails have connection pooling, to achieve good reuse of connections, they still require a lot of open database connections in an idle state.

To go beyond that, software like PgBouncer can get more transactions served over fewer connections, or "multiplex" connections. PgBouncer can be run standalone or may be integrated into your Postgres provider.

To scale into thousands of queries per second, we'd likely add a solution like PgBouncer.

Now that we've looked at IOPS and database connections, what about maximizing the use of memory on our instance?

Efficient use of memory

When data is accessed from the internal buffer cache, the lowest latencies can be achieved. That's the ideal state. We want that to be the case for queries like products for sale, or product details data, as those queries aren't changing much.

However, the buffer cache has a limited capacity, so we want to use it optimally.

Let’s imagine Postgres normally operates with 90% of system memory in use. Commonly, 25% of the total system memory, for example 16GB of our total 64GB available on our 2XL instance, is configured for buffer cache (via the shared_buffers parameter).

With the remaining memory, we will use that for settings like work_mem for individual operations. We want to make sure there's enough per-operation memory available for our queries.

Let's keep exploring limits and scalability options for read IOPS.

Hitting limits on read operations

Using our IOPS proportions and SQL query mapping from earlier, let's discuss our options if we're reaching the limits.

While scaling up to the 32XL instance is one option, a less costly option would be to add a second 2XL instance. By leveraging PostgreSQL physical replication, we could keep the second 2XL instance in sync and use it for a couple of purposes. Let's look at costs:

Second 2XL instance at $650/month, 2x spend, total of $1300/month

One purpose is to provide "high availability", keeping the secondary instance fully in sync to take over the role of primary if needed.

A second purpose is to distribute our read operations. With the new instance, we have a fresh quota available for IOPS. We could run most SELECT queries on the second instance.

Application frameworks like Ruby on Rails natively support Multiple Databases. Ruby on Rails supports writer and reader roles. When read-only HTTP requests are processed, they can automatically use the reader role, helping distribute the requests while minimizing manual application code changes.

Another alternative to framework-level write and read request distribution would be to use an advanced connection proxy like pgcat.

That covers some basics with read operations. What about write operations?

Hitting limits on write operations

Unlike read operations that can be scaled horizontally with additional instances, write operations cannot be scaled this way. When considering our write SQL statements earlier, INSERT, UPDATE, and DELETE we determined they could make up 50% of our total IOPS consumption.

What do we do? One costly option was to scale up to a larger instance like the 32XL one.

Outside of vertically scaling, and without moving to distributed architectures, we've got some options to reduce writes on our instance.

Reducing write operations

Indexes are meant to reduce the latency of read operations, but trade-off adding latency to writes. Any indexes that aren't needed for our SELECT queries will have no scans, and should be removed to reduce write IOPS. We can determine whether our indexes are scanned by using internal bookkeeping within Postgres.

Next, we can look at our INSERT operations and whether they can be made more efficient, by being combined into fewer operations. Frameworks like Ruby on Rails support bulk inserts and updates, and we can leverage those to reduce our operations quantity, with the trade-off that each operation is larger.

Finally, on the DBA side, we can study our write-ahead logging (WAL) configuration, and look for opportunities to reduce it. We can choose to entirely go without WAL for individual tables by marking them UNLOGGED. We can set logging to a minimal level when we're not using replication, which reduces IOPS.

To recap:

Remove unused indexes.
Batch up insert and update operations.
Consider UNLOGGED tables and reduced WAL configuration when appropriate (Risk: data loss)

That wraps up our coverage of OLTP uses. However, Postgres can be used for additional purposes. When might those break?

Beyond OLTP: Near zero RTO and specialized usages

There are uses of Postgres beyond OLTP. Let's step outside our ecommerce example for a moment.

Some applications ingest data very intensively and frequently, such as sensor data, event or metrics data, then they want to report analytics in real-time. This might be classified as time-series data or continuous aggregation, both which are challenging to do on Postgres when working with millions or billions of data points. Specialized extensions like Timescale or Citus address these challenges.

Another usage is searching through large volumes of text in an unstructured way, commonly categorized full text search (FTS). While PostgreSQL has capabilities for FTS built-in, at very high volumes of data and search queries, Postgres may struggle to offer acceptable performance. Again, the extensibility of Postgres is an asset here with various open source options, as well as alternative data stores.

Finally, let's briefly cover how Postgres might be used at very large organizations, with extremely demanding uptime requirements. Any amount of downtime is very costly and not acceptable.

An industry acronym for recovery time is the recovery time objective (RTO), and large financial institutions might have a near-zero RTO. Thus, single instance Postgres, despite a high availability configuration, may not offer the RTO needed, in which case a distributed architecture that's continually replicating all content to all nodes may be needed.

Wrapping Up

In this post we worked with a hypothetical ecommerce web application that's using Postgres as its application database, and we looked at when it might reach its limits, and what we'd do.

To estimate that, we considered concurrent user activity with the resources available on an AWS RDS 2XL instance, and estimated the resource consumption for our current load, and what a 10x growth in load might look like.

Besides the ecommerce application, we discussed some additional more specialized use cases, exploring when single node Postgres may no longer be acceptable.

As we saw, vertically scaling can bring in a very large amount of server resources, albeit at a much higher cost. As an alternative to that much higher price point, the business value of performance optimization becomes clear, eliminating and optimizing IO operations for an application, saving a lot of money on database spend.

If you're facing performance issues with your Postgres database and want to build a more scalable, efficient solution, here are a few ways you can learn more:

Book a call to learn how we can best help you unlock Postgres performance & scalability
Engage our team in an extensive Performance and Scalability review of your database and receive a customized optimization strategy