Tech Unpack: Software

Why Your System Crashes Under Load — and How Kafka and SQS Push Back

Jack Do — Wed, 06 May 2026 12:09:32 GMT

Intro

Welcome back. In Part 2 we looked at what happens when multiple workers fight over the same data — race conditions, stale reads, and why a single Redis thread is actually a feature. Today we move one layer up the stack.

The failure mode this time isn’t a data conflict. It’s volume. Specifically: what happens when data arrives faster than you can process it, and nothing in your system knows how to push back.

That’s backpressure — or rather, the lack of it.

Backpressure in plain terms

Every data pipeline has two roles: a producer generating work, and a consumer processing it. When the producer is faster than the consumer, the difference has to go somewhere — usually a buffer or queue.

Backpressure is the mechanism that lets the consumer tell the producer “slow down, I’m full.” Without it, the buffer just grows until something crashes.

Three problems, three different solutions

In real world backpressure can shows up in different shapes. For this article, I will go through 3 examples with 3 different solutions, hopefully you guys can find actual use case in some of them.

Problem 1: Streaming large data through a single process

Scenario: Let’s imagine a user requests a CSV export of their entire transaction history — 2 million rows. Your service streams rows from Postgres straight into the HTTP response. In testing with 100 rows, fine. In production, Postgres sends rows faster than the HTTP socket can flush them to the client. Node.js buffers the difference in memory. Within 30 seconds, that single request has consumed 800MB of heap. A few concurrent exports later, the process is killed by the OS.

This is a single-process problem. Here, all the data is flowing through one Node.js process. The only way out is to slow down the source until the destination catches up.

The solution: Node.js streams with highWaterMark and .pipe().

Every readable and writable stream in Node has a property called highWaterMark — the maximum amount the internal buffer holds before signalling it’s full. Default is 16KB for byte streams, 16 objects for object-mode streams.

Here’s how the signal works.

// The easy way — .pipe() handles backpressure for you
postgresStream.pipe(httpResponse)

// What .pipe() actually does under the hood:
postgresStream.on('data', (chunk) => {
  // write() returns false when buffer exceeds highWaterMark
  const canContinue = httpResponse.write(chunk)
  
  if (!canContinue) {
    // Buffer is full — stop reading from Postgres
    postgresStream.pause()
  }
})

// When the buffer drains below highWaterMark, resume reading
httpResponse.on('drain', () => {
  postgresStream.resume()
})

Rule of thumb: always use .pipe() instead of writable.write(chunk) and let it handle the loop for you.

Problem 2: High-throughput event pipeline between services

Scenario: Your app emits thousands of events per second — user clicks, payment attempts, feature usage. A downstream analytics service consumes these to update dashboards and write audit trails. During peak hours, events arrive 10x faster than the consumer can process. If nothing slows the producers down, events pile up and crash the consumer, or get dropped on the floor.

The solution: Kafka has a very elegant way to resolve this.

Quick intro: Apache Kafka (or Kafka for short) is a Java based software that is optimized for ingesting and processing high volume of data streaming.

The key idea: Kafka stores events in an ordered log called a topic. Producers append to the end. Consumers read from wherever they left off, at their own pace. The producer never waits for the consumer — the log absorbs the difference.

Each consumer tracks its position with an offset — a number pointing to the last event it processed. The gap between the latest event and the consumer’s offset is called consumer lag — the early warning signal that the consumer is falling behind.

// Fetch a batch of events to process
const analyticsConsumer = kafka.consumer({ groupId: 'analytics' });

await analyticsConsumer.run({
  // max.poll.records: how many events per fetch (default 500)
  // Lower this if processing is slow
  eachBatch: async ({ batch, pause, resume }) => {
    
    if (downstreamIsOverwhelmed()) {
      // Tell Kafka to stop sending us events
      // Events stay safe in the log — nothing is lost
      pause([{ topic: 'user-events' }])
      
      // Resume after a cooldown
      setTimeout(() => resume([{ topic: 'user-events' }]), 30_000)
      return
    }
    
    await processBatch(batch)
    // Commit our new offset — we've processed up to here
  },
})

const auditConsumer = kafka.consumer({ groupId: 'audit' });
// ....

Key gotcha: if your processing exceeds max.poll.interval.ms (default 5 minutes), Kafka assumes your consumer is dead and reassigns its work to another consumer. Same events get re-processed when you rejoin. For slow consumers, drop max.poll.records to 10–50.

Problem 3: Massive parallel async tasks

Scenario. You run a serverless image processing pipeline. Users upload images via API, the API drops a message into an SQS queue, and Lambda functions process each image — resize, watermark, store. Normal traffic is 50 messages/min. A marketing campaign goes live — suddenly it's 50,000 messages/min. Lambda scales fast and spawns 800 concurrent executions. Each one opens a Postgres connection. Your database has a 100-connection pool. 700 connections get refused, Lambda retries, SQS re-delivers, and you now have a thundering herd pointed straight at your database.

Why doesn't Kafka solve this? Kafka is great for streaming events, but each message here is a task — one image to process, then delete. You don't need replay, you don't need multiple consumers reading the same data. You need a queue where each task is picked up once, plus an autoscaling consumer that doesn't bury your database.

The solution: SQS visibility timeout + Lambda reserved concurrency.

const imageJobs = new Queue(this, 'ImageJobs', {
  visibilityTimeout: Duration.seconds(60),
})

const imageProcessor = new Function(this, 'ImageProcessor', {
  runtime: Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: Code.fromAsset('lambda'),
  timeout: Duration.seconds(30),

  // Cap parallel executions — the actual backpressure mechanism.
  // No matter how many messages pile up in SQS, only 50 Lambdas
  // run at once. Messages wait in the queue for a free slot.
  reservedConcurrentExecutions: 50,
})

imageProcessor.addEventSource(new SqsEventSource(imageJobs, {
  batchSize: 10,
  maxConcurrency: 50,
}))

Without reservedConcurrentExecutions, Lambda will spawn hundreds of parallel functions during a spike — each opening a database connection, each adding load. The cap is what turns Lambda from "fan out infinitely" into a controlled consumer.

Overlaps between AWS SQS/SNS and Apache Kafka

From the simple example above, one may wonder we can switch between either SNS/SQS (Topic Fanout design) and Kafka. But in reality, there are huge different between the 2. Here are quick overview of both.

Conclusion

If you guys have reached this point, congrats, it has been a long article that go through briefly 2 very popular technologies in designing distributed system. I will let you guys go for now, and see you all in next part

Next up — Part 4: Stateless Scaling. How load balancers decide where your request goes, why sticky sessions are a trap, and what "stateless" actually requires from your application layer.

The Shared State Problem — How real world system actually solve it

Jack Do — Sun, 03 May 2026 06:47:00 GMT

Intro

Welcome back to my deep dive system design series, if you guys read part 1 and find them useful, this article will be the following step to answer some of the questions you may have. In previous part, we talked about concurrency, parallelism and async.

Here's the thing nobody tells you after you understand async: running things concurrently creates a new class of bug. Two requests touch the same balance. A counter drifts. A user gets charged twice. The fix sounds obvious — "just add a lock" — until you realize that's the beginning of the problem, not the end.

In this article I will go into the implementation of 2 famous technologies Postgres and Redis, so we can understand how each of them tackle the shared state problem. The third example will be from Stripe - a well known payment processor which we will see how ‘shared state issues’ got addressed at a distributed system.

The technical detour

Before deep dive into practical implementations, the following terms are quite common that I think we should get our head around.

Race condition — is what happens when the correctness of your program depends on the order in which concurrent operations run — and that order isn’t guaranteed. It’s not that the logic is wrong; it’s that the order has been messed up.

Mutex — (mutual exclusion lock) is a flag that says “only one thing at a time can be in this section of code.” Threads or async tasks line up, take turns, release. Some good examples here are threading.Lock() in Python, or sync.Mutex in Go.

Atomic operation — is a single instruction the CPU promises to complete without interruption. “Add 1 to this number” sounds like one step but is actually three (read, add, write). Atomics make it genuinely one step, so no other thread can sneak in between. Atomic is a fundamental part in ACID transactions property that make engineers choose RDBMS to prioritizes consistency over availability.

Optimistic concurrency — flips the whole model. Instead of locking before you read, you read freely, do your work, and at write time check “is the data still what I saw?” If yes, commit. If no, retry. No queueing, no waiting — but you pay for it in retries when contention is high.

How the smart systems actually handle it

Postgres — versioning instead of locking

When txn 101 updates acc_42, Postgres doesn't overwrite the row. It writes a new version of it — tagged with the transaction ID that created it (xmin=101). The old row stays on disk, tagged as visible to transactions that started before 101. That's why txn 100 — a long-running analytics query — still reads $1,000 while txn 101 is writing $300. They're reading different physical rows, both legitimate, neither blocking the other. The cost is storage: old row versions accumulate until vacuum reclaims them, which is why long-running transactions are dangerous — they pin old versions that vacuum can't touch. Deep dive on MVCC, vacuum, and the connection pool problem in Part 5.

Redis — sidestep the problem entirely

Redis doesn't solve concurrent shared state — it makes it structurally impossible. Every command from every client lands in a single queue. The event loop pops one command, runs it to completion, moves to the next. INCR counter is a read, increment, and write — but because nothing else can run between those steps, it's atomic by design, not by locking. The cost is that one slow command (KEYS *, a blocking LRANGE on a 10M-item list) stalls every other client. Redis is fast precisely because its operations are tiny. If you're running big operations, you're using it wrong. Cache stampedes and why "just put Redis in front of it" has its own failure modes: Part 6.

Stripe — idempotency keys for distributed shared state

Once state lives across machines, in-process locks and MVCC stop helping — a network retry doesn't care what's in your mutex. Stripe's answer is a lookup table. Before executing anything, the server checks: have I seen this key before? First time — insert a row, run the charge, store the result, mark complete. Duplicate — return the stored result, never touch the payment processor. The elegance is in the processing state: if the first request is still in-flight when the retry arrives, the server returns a 409 Conflict and the client backs off. The distributed coordination problem reduces to a single database row. Distributed locks, Redlock, and the CAP theorem trade-offs in a later post.

I know the concepts, what now?

The following are some common mistakes that I found even seasoned engineers can make.

Mistake 1: Wrapping everything in a giant lock

A team hits a race condition, panics, and wraps the entire request handler in a global mutex. Bug fixed. Throughput drops 90%. Lock the smallest unit you can — a row, a key, a single object. Better yet, ask whether the problem can move to the database where MVCC handles it for you.

Principle. Locks are a tax on concurrency. The bigger the lock, the higher the tax. Correct-but-slow under load eventually becomes just wrong, as timeouts cascade into retries into outages.

Mistake 2: Forgetting state lives outside your process too

A service has perfect in-memory locking. It scales to two replicas. The locks now do nothing — they only protect within one process. For state shared across machines, you need a coordination layer: a database transaction, a Redis lock, an idempotency key, or a queue with single-consumer semantics.

Principle. In-process locks don’t survive horizontal scaling. The day you add a second replica is the day your single-process locking assumptions quietly break.

Decision tree

Before reaching for a lock, walk this:

Is the shared state actually shared? Can each request own its own copy? If yes — do that, you’re done.
Is the operation a single read-modify-write on one value? Use an atomic. Done.
Are conflicts rare? Optimistic concurrency (version check, compare-and-swap) — fast in the happy path.
Are conflicts common, but the critical section small? A scoped mutex on the smallest unit you can.
Does the state live across machines? You’re not in lock territory anymore — you’re in distributed coordination territory. Reach for a database transaction, a queue, or an idempotency key.

Conclusion

And that’s all what I have for you guys today. Once you’ve got concurrency working and shared state under control, the next thing to break you is load itself. What happens when traffic exceeds capacity? Do requests queue forever? Do you drop them? Do you make the upstream slow down?

Next post: Why Your System Crashes Under Load — and How Kafka and SQS Push Back

Concurrency vs Parallelism — Stop Mixing Them Up

Jack Do — Sat, 02 May 2026 01:26:28 GMT

Intro

Welcome back to Tech Unpack. This is the first deep dive in a seven-part series called “Building Software That Doesn’t Fall Over” — a tour through the engineering decisions that separate systems surviving real production load from systems that buckle the moment traffic spikes.

A production-grade service has to solve a whole stack of problems: handling huge volumes of concurrent requests, sharing state safely between them, absorbing traffic spikes without crashing, scaling out across machines, surviving database bottlenecks, caching without going stale, and staying observable when things break. Each of these gets its own post in this series.

Today we’re starting with the foundation everything else sits on: how do you serve thousands of concurrent requests on a single machine without it grinding to a halt?

The answer comes down to picking the right tool for the workload — and that means getting clear on three concepts that get tossed around interchangeably but actually mean very different things

The three words, explained simply

Picture a typical workday at the office.

Concurrency is managing multiple tasks at once. You’re on a Zoom call, but while someone else is talking, you reply to a Slack message. You’re not doing two things at the exact same instant — you’re juggling them. One person, multiple things in flight.
Parallelism is actually doing multiple tasks at the same instant. You and a colleague each handle your own meeting, at the same time. Two people, two meetings, real wall-clock speedup.
Async is a way to handle concurrent work without blocking. Instead of sitting there refreshing your inbox waiting for a reply, you send the email and move on. When the reply lands, your laptop pings. It’s a way of working, not a separate concept.

The trap most devs fall into: async code is concurrent, but not parallel. A single Node.js event loop juggles thousands of async tasks — but only one runs at a time. If that one task hogs the CPU, everything else waits.

Concepts in action

The best way to understand and apply these concepts are by looking at it in a demo. But before going to the fun parts, there are 2 more concepts we need to clear our head. (Yes explain something by giving definitions to something else 😭)

I/O-bound work is anything where your code is waiting on something external — a network response, a database query, a file read, an API call. The CPU sits idle during the wait. If you timed the work, most of the milliseconds would be “doing nothing, waiting for someone else.”

CPU-bound work is anything where your code is actively computing — hashing, parsing, image resizing, regex matching, JSON-serializing a huge object. The CPU is pegged the whole time. There’s no waiting; there’s just work.

A simple test: if a faster network would speed up the task, it’s I/O-bound. If a faster CPU would speed it up, it’s CPU-bound.

Demo: same task, three approaches

I built a small TypeScript demo that runs two tasks — one I/O-bound, one CPU-bound — three different ways: sequential, async-concurrent (Promise.all), and parallel (worker threads). The full code is on GitHub if you want to clone and run it yourself:

👉 github.com/jackdo68/concurrency-demo

So basically there are 2 files in the code base (you already can guess, one for I/O tasks, one for CPU intensive tasks) called io-demo.ts and cpu-demo.ts.

I/O demo I ran an actual HTTP request to an server using 2 manner sequential and concurrent. And the result as you can tell, the “concurrent” will finish first, because while wait for the response of one, the machine can start doing the job of another. The screenshot below shows the result logs in milliseconds

> concurrency-demo@1.0.0 io
> tsx io-demo.ts

Sequential: 15162ms
Concurrent: 3172ms

CPU demo is about running multiple CPU tensive tasks in 3 manners sequential, promise.all/async and parallel (spawn up child workers). Since the CPU has to actually do the work here, sequential and async manner roughly has the same latency (in fact async has some overhead and actually takes longer) while parallel really shine (but of course with extra help of sub workers/child process)

> concurrency-demo@1.0.0 cpu
> tsx cpu-demo.ts

Sequential: 4300ms
Promise.all: 4325ms
Parallel: 2236ms

I know the concepts, what now?

Here’s where the three concepts become a toolkit. Most production outages I’ve seen come from one of three mistakes — and each one maps to a misuse of these concepts.

Mistake 1: Treating I/O work like CPU work

Example A service makes 50 calls to downstream APIs, one after another. Each takes 200ms. Total: 10 seconds. Under load, requests pile up, the server hits its connection limit, and everything starts 504-ing.

The fix: async concurrency. Fire the calls together with Promise.all (or asyncio.gather, or a Go errgroup) and total time drops to ~200ms. Same server, same hardware, 50x throughput on that endpoint — for free.

Resilience principle: Never wait sequentially for things that don’t depend on each other. This is the cheapest performance win in software, and the easiest one to miss. Companies like Netflix, Uber, and PayPal lean heavily on Node for I/O-heavy services — API gateways, real-time feeds, backends-for-frontends — where holding thousands of open connections cheaply matters more than raw compute. Node JS - Event loop

Mistake 2: Treating CPU work like I/O work

Example A team adds image resizing to their Node API. They wrap it in async/await and assume the event loop will handle it. Under load, every request waits for the resize to finish — because the event loop is one thread. P99 latency goes through the roof. Health checks start failing because the loop can’t even respond to /health for 800ms at a time. The load balancer marks the instance unhealthy, traffic shifts to the survivors, and they fall over too.

The fix: offload CPU work to a worker pool so the event loop stays free. Or better, push it to a separate service entirely (a queue plus dedicated workers).

Resilience principle: CPU work needs its own lane. If a single slow operation can block your health check, your load balancer will eventually kill the instance and your traffic will cascade onto whoever’s left. Instagram runs a Django monolith on Python, which has the Global Interpreter Lock (GIL) — only one thread executes Python bytecode at a time — so they sidestep it with a multiprocess model behind Gunicorn, running many worker processes per machine. Instagram - Engineering

Mistake 3: Unbounded concurrency

Example The opposite of mistake 1. A dev hears “use Promise.all“ and fires 10,000 DB queries at once. The connection pool is exhausted in milliseconds. The DB starts queuing, then timing out. Suddenly every service sharing that DB is degraded — including the ones that had nothing to do with this request.

The fix: bounded concurrency. Use a semaphore or a library like p-limit to cap how many things run at once. Twenty in flight at a time is usually plenty.

Resilience principle: Concurrency is a resource — budget it. “As fast as possible” usually means “fast enough to break something downstream.” We’ll go much deeper on this in Part 3 (backpressure).

Conclusion

Thanks for being patient and keep reading until this point, but this is just only the beginning. Once you have multiple things running at once — async or parallel — they start stepping on each other. Two requests update the same counter. A cache write races a cache read. Welcome to shared state, the bug factory at the heart of every concurrent system.

Next post: race conditions, mutexes, and why “just add a lock” is the start of a much longer conversation.