Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Portfolio

Hey!
I’m Tam, a software engineer with over five years of experience, primarily in the AdTech industry building low latency, real time systems (sub-100ms)

Here are some of the challenges I’m proud to have solved:

Tech Stack

  • Languages: Golang, C++, Rust, Python, Java, Bash
  • Databases & Messaging: Redis, MongoDB, Kafka, Aerospike
  • Cloud & Infra: AWS (S3, SQS), GCP (GCS, BigTable, BigQuery), Kubernetes, Docker, Nginx
  • Observability & Networking: Grafana, Prometheus, TCPDump, Wireshark, HTTP, DNS, Protobuf, WebSocket
  • Domains: Microservices, Real-time Systems
  • Others: Uber H3, Bloom filters, Matrix-based filtering, Golang profiling

Personal Projects

  • Geoplot: A lightweight tool to visualize CSV files containing latitude and longitude data on a world map.
  • Geodata: A tool to generate Uber H3 cell IDs, latitudes, and longitudes for all countries.

Contact

GitHub · LinkedIn

More Threads Can Harm Your Performance

TL;DR: By reducing the number of worker threads and unblocking our main event loop in a C++ Reactor-pattern system, we improved overall performance and downgraded our compute node from 36 cores to 16 cores. Across a deployment of 50 instances, this reduced our cluster from 1800 cores down to just 800 cores.

Introduction

In RTB (real time bidding) DSP (demand side platform) system, things move fast, usually the server must respond in 100ms. If it takes too long, it loses the auction.

I used to work on a RTB DSP system built with RTBKit, a fast C++ bidding engine. RTBKit is very powerful, but it has some tricky parts. Once, to handle more traffic, I simply added more worker threads. I thought using more CPU cores would mean more speed. But the opposite happened: our system became much slower.

This post explains why spinning up too many threads in a Reactor-pattern system can be bad, and how fixing it allowed us to improve performance significantly.

RTBKit Architecture

RTBKit uses a mix of thread pools and an event loop to process requests. The Event Loop sits in the center and routes messages between the different components. Here is a simple view of the system:

                          +------------------+
                          |                  |
   +----------------+     |  Augmenter       |     +----------------+
   |                |     |  Threads         |     |                |
   | Exchange       |     |                  |     | Bidder         |
   | Worker Threads |     +------------------+     | Threads        |
   |                |          ^        |          |                |
   +----------------+          |        |          +----------------+
      ^        |               |        v               ^        |
      |        |          +------------------+          |        |
      |        +--------> |                  | <--------+        |
      |                   |    EVENT LOOP    |                   |
      +------------------ |    (Reactor)     | ------------------+
                          |                  |
                          +------------------+
  • Exchange workers: These are the main workers, they receive the HTTP bid request, parse the JSON, and match the bid request against a large number of campaigns. This is very CPU-intensive work.
  • Event Loop: The exchange worker thread finishes and sends a message to a queue. The event loop picks it up and routes it to the augmenter threads.
  • Augmenters: Add extra data to the request. IO-bound work.
  • Bidders: Decide if they want to bid or not. IO-bound work.
  • Send Response: The response routes back through the event loop to the exchange worker threads, which send it back to the ad exchange.

The Reactor Pattern and the Event Loop

In this system, components don’t call each other directly. They talk using queues. To tell another component that a message is ready, they use a single event loop. This is the Reactor Pattern. When a worker pushes a message to the queue, it wakes up the single event loop thread. The event loop reads the message and routes it to the right place.

This is a good design, but there is one weak spot: the single event loop thread. All messages must go through this one thread.

The Problem: adding too many threads

Our traffic was growing and requests started to timeout. Our server had 36 CPU cores per instance, so we thought: “We should add more exchange worker threads to utilize all these cores.” We increased our worker threads to match the available cores. We hoped the system would handle more requests.

Results: CPU usage got higher (which was expected, more cores joined the party), but request timeouts increased.

How could adding more workers make the system slower?

Debugging the Issue

We checked what the CPU was doing (with htop). Because you can set names for threads, we could see that the event loop thread was overloaded at maximum CPU. We added too many worker threads, so they were all parsing requests, matching campaigns, and sending messages to the event loop at the same time. The single event loop could not read and route messages fast enough

How We Fixed It

Fix 1: Remove work from the event loop

The Event Loop was running hot. In a Reactor pattern, we know the event loop should only route events, not actually do any work. So we hunted for any work inside the event loop. And we found that there were some “slightly slow” tasks, such as logging and data formatting, like this:

void EventLoop::dispatch(Message msg) {
    std::string logMsg = formatMetric(msg); 
    logger.log(logMsg);
    targetQueue.push(msg);
}

These things only take a little time, but workers thread are pushing the event loop hard, so it adds up. We moved these non-critical work out of the event loop into the worker threads. After this change, performance slightly improved, which proved our point and built confidence a bit.

Fix 2: Reduce the worker threads

Then we scaled down the thread pool based on what the event loop could handle, more threads are not always better. We must match the thread count to our system’s architectural bottlenecks. We tested multiple configurations. By reducing the number of workers, the event loop had less pressure, and the system became noticeably faster. We discovered the sweet spot was using only 16 cores instead of 36 cores per instance. Across our deployment of 50 instances, this reduced our total footprint from 1800 cores back down to 800 cores, while maintaining better latency and higher throughput

Takeaways

  • Thread tuning: More is not always better. Always test and find the best configuration for your system.
  • Event loop design: Keep it fast. The event loop is only to route events. Never put even slightly slow code inside it.

AI was used to help refine and polish this article based on factual information

Real-time Weather Augmentation

Our HTTP server gets global traffic and must respond in under 100ms. For requests from specific regions (like US or EU), we need to add local weather data (temperature, rain, etc.) before passing them to upstream processing. There is a lot of work upstream, so weather augmentation must finish in under 3ms.

The requirements are strict:

  • Input: The request has geographic coordinates (latitude and longitude).
  • External API: Weather data comes from a third-party API, it can take seconds to respond.
  • Latency: Must finish in under 3ms.
  • Targeting: We only add weather data to requests from targeted regions.

We have two main challenges on the hot path:

  • Checking if a request needs weather data.
  • Augmenting the request with weather data within the latency budget.

Challenge 1: Check if a request needs weather data

This happens for every HTTP request, so it must be very fast and have predictable execution time.

The Naive Approach: Ray Casting

We can save the borders of the US and EU as complex polygons. When a request comes in, we run a “point-in-polygon” algorithm like ray casting. This approach is simple, but:

  • Heavy CPU load: Country borders are complex. The US polygon alone has thousands of edges. Ray casting is expensive per request.
  • Unpredictable latency: Computation time depends on polygon shape and point location. A point inside a complex border takes much longer than a point outside. This breaks p99 latency limits.
+-------------+
| HTTP Request|
| (lat, lng)  |
+------+------+
       |
       v
+--------------+
| Ray Casting  |
+------+-------+
       |
       v
+--------------+
|   Yes / No   |
+--------------+

The Optimized Approach: Uber H3

To remove heavy computation from hot path, we do the work offline. We use Uber H3, a grid system that divides the globe into hexagons. Each hexagon covers a geographic area and has a unique 64-bit ID.

Offline Preprocessing: We map our region polygons (US and EU) onto the H3 grid at a chosen resolution. We find which H3 hexagons are inside our polygons. The result is a set of Target Cell IDs. We load this set into memory as a Go map[uint64]bool.

+------------------+
| Region Polygons  |
+---------+--------+
          |
          v
+------------------+
| Polygon -> H3    |
| Cell Conversion  |
+---------+--------+
          | (Offline)
          v
+------------------+
| Set of Cell IDs  |
+------------------+

Hot Path Implementation: When a request arrives, we convert its coordinates to an H3 cell ID using the H3 library. Then we check if this ID exists in our in-memory set.

+-------------+
| HTTP Request|
| (lat, lng)  |
+------+------+
       |
       v
+--------------+
| lat/lng ->   |
| H3 Cell ID   |
+------+-------+
       | O(1) Lookup
       v
+--------------+
| Cell ID in   |
| Target Set?  |
+------+-------+
       |
       v
+--------------+
|  Yes / No    |
+--------------+

Pseudocode:

// Pre-loaded in memory at startup
var targetH3Cells map[uint64]bool // populated from offline preprocessing

func ShouldAugmentWeather(lat, lng float64) bool {
    // Convert lat/lng to an H3 index at resolution 5.
    // This is purely mathematical and takes nanoseconds.
    cellID := h3.LatLngToCell(h3.LatLng{Lat: lat, Lng: lng}, 5)

    // O(1) lookup
    return targetH3Cells[uint64(cellID)]
}

Building the Cell ID Set

We use geodata, a small open-source tool that generates H3 cell IDs for every country in the world. It reads country boundary polygons from Natural Earth GeoJSON data, fills each polygon with H3 cells at a given resolution, and writes the results to CSV files.

Performance Tuning: H3 Resolution

Choosing H3 resolution is a space-time trade-off.

  • Higher resolution (smaller hexagons): Better accuracy, more memory
  • Lower resolution (larger hexagons): Less accurate, less memory

For us, resolution 5 (one hexagon is ~252 sq km, each edge is ~8.5 km) is a good starting point. Weather targeting doesn’t need high accuracy. We can easily adjust if the requirement change. Another optimization is to apply different resolutions at different regions, but that is not covered here


Challenge 2: Augmenting Requests Without Blocking

Now we know the request is in the targeted region. We must fetch and attach weather data to request in 3ms.

The Naive Approach: Direct API Calls

The simple way is to call the third-party weather API in the request handler. This immediately breaks the 3ms budget as external API takes second to responds

The Optimized Approach: Hit-Miss Cache with Refresher

Weather does not change every minutes. If two requests are in the same H3 hexagon within 15 minutes, the weather data is probably the same. Even if weather changes, the weather data vendor may still not update very frequently (I know, I just can’t prove it 😏)

We decouple the hot path from the slow API. We use Redis as a cache, and a background worker to call the external API.

The Hot Path Flow

In the request handler, when a request needs weather, we get its H3 Cell ID and check Redis for cached data.

  • Cache Hit: Good
  • Cache Miss: We add that missing cell ID to a Redis Set and forward the request upstream without weather data
+------------------+
| HTTP Request     |
| (lat, lng)       |
+--------+---------+
         |
         v
+------------------+
| lat/lng -> H3 ID |
+--------+---------+
         |
         v
+------------------+
| Redis Lookup     |
+----+--------+----+
     |        |
   Hit      Miss
     |        |
     v        v
+--------+   +----------------------+
| Attach |   | Add cell ID to Redis |
| Weather|   | set: cells_to_fetch  |
+---+----+   +----------+-----------+
    |                   |
    +--------+----------+
             |
             v
    +------------------+
    | Forward to       |
    | Upstream         |
    +------------------+

The Background Refresher

We run a background worker alongside our HTTP server. We call it the Weather Refresher.

Every few seconds, the refresher reads the cells_to_fetch Set in Redis. It pops the cell IDs, converts them to latitude/longitude points, and calls the slow API. When it gets the data, it saves it in Redis with a TTL of 30 to 60 minutes.

(runs every 10s)

+----------------------+
| Weather Refresher    |
+----------+-----------+
           |
           v
+----------------------+
| Pop cells_to_fetch   |
+----------+-----------+
           |
           v
+----------------------+
| Cell -> lat/lng      |
+----------+-----------+
           |
           v
+----------------------+
| Call Weather API     |
| (slow, ~seconds)     |
+----------+-----------+
           |
           v
+----------------------+
| Update Redis Cache   |
| (Set key + TTL)      |
+----------------------+

In our cache miss path, we use Redis SADD. If a popular location’s cache expires, we might get thousands of requests from that cell instantly. Using a Redis Set, items are automatically deduplicated. The worker fetches data for each cell only once.

Putting It All Together

                   OFFLINE (build time)                            RUNTIME
          ┌──────────────────────────────┐
          │  Region Polygons (GeoJSON)   │
          └──────────────┬───────────────┘
                         │
                         v
          ┌──────────────────────────────┐
          │  geodata: Polygon -> H3      │
          │  Cell IDs (resolution 5)     │
          └──────────────┬───────────────┘
                         │
                         v
          ┌──────────────────────────────┐
          │  CSV files per country       │        ┌───────────────────────────────┐
          │  (e.g. h3_res_5_usa.csv)     │───────>│  Startup: load CSV into       │
          └──────────────────────────────┘        │  map[uint64]bool (Target Set) │
                                                  └──────────────┬────────────────┘
                                                                 │
            ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
                                                                 │
                            HOT PATH (per request, < 3ms)        │
                                                                 v
                                                  ┌───────────────────────────────┐
                                                  │  HTTP Request (lat, lng)      │
                                                  └──────────────┬────────────────┘
                                                                 │
                                                                 v
                                                  ┌───────────────────────────────┐
                                                  │  lat/lng -> H3 Cell ID        │
                                                  └──────────────┬────────────────┘
                                                                 │
                                                                 v
                                                  ┌────────────────────-──────────┐
                                                  │  Cell ID in Target Set?       │
                                                  └──────┬────────────────────────┘
                                                         │                  │
                                                        YES                 NO
                                                         │                  │
                                                         v                  │
                                                  ┌──────────────────┐      │
                                                  │  Redis Lookup    │      │
                                                  │  (cache key =    │      │
                                                  │   H3 Cell ID)    │      │
                                                  └──┬───────────┬───┘      │
                                                     │           │          │
                                                   HIT          MISS        │
                                                     │           │          │
                                                     v           v          │
                                              ┌──────────┐ ┌────────────┐   │
                                              │  Attach  │ │ SADD cell  │   │
                                              │  weather │ │ to Redis   │   │
                                              │  data    │ │ set:       │   │
                                              │          │ │ cells_to_  │   │
                                              │          │ │ fetch      │   │
                                              └────┬─────┘ └─────┬──────┘   │
                                                   │             │          │
                                                   │      ┌──────┘          │
                                                   │      │  Do nothing     │
                                                   │      │                 │
                                                   │      │                 │
                                                   │      │                 │
                                                   └──┬───┘                 │
                                                      │                     │
                                                      v                     │
                                                  ┌──────────────────┐      │
                                                  │  Forward to      │<────-┘
                                                  │  Upstream        │
                                                  └──────────────────┘

            ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

                            BACKGROUND LOOP (every 10s)

                        ┌─────────────────────────────────────┐
                        │  Weather Refresher                  │
                        └──────────────────┬──────────────────┘
                                           │
                                           v
                        ┌─────────────────────────────────────┐
                        │  SPOP cells_to_fetch (up to 50)     │
                        └──────────────────┬──────────────────┘
                                           │
                                           v
                        ┌─────────────────────────────────────┐
                        │  Cell ID -> lat/lng                 │
                        └──────────────────┬──────────────────┘
                                           │
                                           v
                        ┌─────────────────────────────────────┐
                        │  Call Weather API (slow, ~seconds)  │
                        └──────────────────┬──────────────────┘
                                           │
                                           v
                        ┌─────────────────────────────────────┐
                        │  Update Redis Cache (key + TTL)     │
                        └─────────────────────────────────────┘

Takeaways

  • Move heavy math offline
  • Remove uncontrollable latency from the hot path
  • Decouple with background workers

AI was used to help refine and polish this article based on factual information

Custom Routing

Disclaimer: All numbers and examples in this article describe abstract ideas. They are not exact facts about any real system.

A RTB system handles 10,000 active campaigns, is horizontally scalable — we can add more compute nodes to handle more load. But scaling campaign capacity revealed a real problem: the more we scaled, the worse our match rate got.


The Problem: Campaigns Are Spread Thin

Matching a bid request to a campaign is CPU-heavy. For each incoming bid request, a node must check eligibility across every campaign: targeting rules, budget limits, frequency caps, and more. A single node cannot handle 10,000 campaigns at full traffic, so we distribute campaigns across nodes.

Each node only holds a subset of campaigns — around 1,000. That means at any moment:

Total campaigns: 10,000
Nodes: 10
Campaigns per node: ~1,000

Incoming bid requests are routed to one node. That node only “sees” its 1,000 campaigns. If the right campaign for this request lives on a different node, we miss the match entirely.

As we added more nodes to scale capacity, campaigns were spread even thinner across them. The match rate did not improve — it got worse.

   Bid Request
       |
       v
+------+------+
|    Node 1   |  <- sees only 1,000 of 10,000 campaigns
| 1,000 camps |
+-------------+
+-------------+
|    Node 2   |
| 1,000 camps |
+-------------+
   ... x10

This is a fundamental issue with random (round-robin) routing. Every node is equally likely to receive a bid request, but no node has visibility of all campaigns. Horizontal scaling adds capacity but hurts utilization.


The Solution: A Custom Reverse Proxy

We built a reverse proxy that sits in front of the compute nodes. Instead of routing requests randomly, it routes each bid request to the node most likely to have a matching campaign.

The proxy does not run the full filtering logic — that is too CPU-heavy. Instead, it applies a small number of fast, lightweight filter checks. The goal is to eliminate clearly wrong nodes and increase the probability that a bid request lands on a node with a matching campaign.

   Bid Request
       |
       v
+------------------+
|  Reverse Proxy   |  <- applies lightweight filter logic
|  (Go + Redis)    |
+--+---+---+---+---+
   |   |   |   |
   v       v
+-------+  +-------+
| Node1 |  | Node3 |  <- best candidates for this request
| 1,000 |  | 1,000 |
| camps |  | camps |
+-------+  +-------+

We built this in Go using the standard net/http/httputil.ReverseProxy package with custom routing logic. Redis stores per-node metadata — the campaigns each node holds and which targeting dimensions they cover.

We store this metadata in Redis instead of loading it into Go’s memory as a large map. The reason: keeping millions of entries in a map[string]string inside a Go process puts heavy pressure on the garbage collector. Go’s GC must scan every pointer in the map on every cycle, which causes periodic CPU spikes and unpredictable tail latency. Moving the data to Redis removes it from Go’s heap entirely. We covered this issue in detail in Large Maps Are Bad for Go GC, which we discovered while building this proxy.


How the Routing Works

When a bid request arrives at the proxy:

  1. Read key attributes from the request (e.g. country, device type, ad format).
  2. For each compute node, check Redis to see if that node has any active campaigns that match those attributes.
  3. Route the request to the node with the highest number of potential matches.

We only apply a few fast filter checks — the ones with high selectivity and low compute cost. The full filtering (budget checks, frequency caps, etc.) still happens on the compute node.

Bid Request
    |
    v
+---------------------+
| Parse Request Attrs |  <- country, device, format, etc.
+----------+----------+
           |
           v
+---------------------+
| Query Redis:        |
| Which nodes have    |
| matching campaigns? |
+----------+----------+
           |
           v
+---------------------+
| Pick best node      |
| (highest match      |
|  potential)         |
+----------+----------+
           |
           v
+---------------------+
| Forward to Node     |
+---------------------+

Pseudocode for the routing decision:

func pickBestNode(req BidRequest, nodes []Node) Node {
    bestNode := nodes[0]
    bestScore := 0

    for _, node := range nodes {
        // Lightweight Redis check: how many campaigns on this node
        // match the request's key attributes?
        score := redis.GetMatchScore(node.ID, req.Country, req.DeviceType, req.AdFormat)

        if score > bestScore {
            bestScore = score
            bestNode = node
        }
    }

    return bestNode
}

Upstream Health Checks

A naive reverse proxy forwards traffic to any node that is alive. We need two additional checks.

Check 1: Node is Up

This is standard. If a node does not respond to health checks, we stop sending requests to it. We poll each node’s health endpoint and remove failed nodes from the routing pool.

Check 2: Node Has Active Campaigns

This one is less obvious. When a node crashes and restarts, it comes back up healthy but empty — it has no active campaigns yet because it has not finished loading its campaign data from the data store.

This is a real problem. An empty node responds to requests very fast (it has no campaigns to check, so it does nothing). Without this check, the proxy would see a fast, healthy node and flood it with traffic. The node would respond quickly with zero bids, wasting every single request.

We require each compute node to expose an API endpoint that returns its current active campaign count:

GET /status
{
  "active_campaigns": 1024
}

The proxy checks this count. If it is zero (or below a minimum threshold), the node is skipped from the routing pool until it is ready.

+------------------+
| Health Check     |
| per node         |
+------------------+
        |
        v
  Is node alive?
     /       \
   No         Yes
    |           |
  Remove      Does it have
  from pool   campaigns?
              /       \
            No         Yes
             |           |
           Skip        Include
           node        in pool

Results and Trade-offs

After deploying the custom routing proxy, our match rate increased by around 30%. This came entirely from routing bid requests to nodes that have relevant campaigns, instead of routing randomly.

The trade-off is added latency. The proxy needs to query Redis and run routing logic before forwarding. This adds around 5ms to each request. For our system, that is acceptable — bid requests have a strict deadline (usually 100ms total), and 5ms leaves enough room for the compute node to do full filtering and ML scoring.

The proxy also adds one more component to the system. We run it with multiple instances behind a load balancer to avoid it becoming a single point of failure.


Key Takeaways

  1. Random routing is not always right. In systems where data is partitioned across nodes, routing requests randomly means lower utilization. Custom routing improves match rate without adding hardware.
  2. Do lightweight filtering at the proxy layer. Running full filtering at the proxy is too expensive. A few fast checks are enough to make better routing decisions.
  3. Healthy does not mean ready. A node that just restarted can appear healthy but have no data. Check application-level readiness, not just network-level liveness.
  4. Fast nodes attract traffic. An empty node responds fast. Without readiness checks, a proxy will send it all the traffic. Always check for readiness before routing.

AI was used to help refine and polish this article based on factual information

Large Maps Are Bad for Go GC

How Go GC Works

Go uses a concurrent mark-and-sweep garbage collector. During every GC cycle, the runtime has to figure out which objects in memory are still being used (live) and which can be safely thrown away. It does this in two main phases:

  1. Mark phase: The GC starts from “roots” (like global variables and stack variables) and traces every pointer it can find. If it finds a pointer to an object, it marks that object as “alive”. It keeps following pointers from object to object until everything reachable is marked.
  2. Sweep phase: The GC goes through memory and reclaims any space occupied by objects that were not marked as alive.

The critical thing to understand here is that the cost of the mark phase scales with the number of pointers it has to scan, not the raw number of bytes in memory.

If you have a 1 Gigabyte array of pure bytes ([]byte), the GC looks at it, sees there are zero pointers inside, and moves on immediately. It takes almost zero time. But if you allocate 1 Gigabyte of small objects linked together by millions of pointers, the GC has to chase down every single one of those pointers. That takes a lot of CPU cycles and pauses your application.

This is a well-known issue in the Go community (runtime: Large maps cause significant GC pauses).

Large Map in Go

Let’s look at map[string]string. When you create a map with millions of entries, you might think you are just storing keys and values. But under the hood, a Go map is a complex hash table.

A map[string]string contains:

  • A pointer to the internal hmap struct (the header).
  • Pointers to an array of buckets. Each bucket holds up to 8 key-value pairs.
  • Overflow buckets, which are linked lists of extra buckets if there are collisions.
  • For each entry in the map, there is a key and a value.

Wait, it gets worse. A string in Go is not just a blob of text. Under the hood, a string is a struct containing two things: a pointer to the actual underlying byte array, and an integer for the length.

So, if you have a map[string]string with 10 million entries, you do not just have 10 million items. You have:

  • Millions of internal bucket pointers.
  • 10 million pointers for the keys (the string headers).
  • 10 million pointers for the values.

That is over 20 million individual pointers! During every single GC cycle, the Go standard garbage collector must scan all of them. Even if you never modify the map, the GC does not know that. It has to scan the whole thing every time to make sure memory is still reachable

The case: The high-traffic HTTP reverse proxy

I once worked on an HTTP reverse proxy service written in Go. Its job was to apply custom routing logic, mapping incoming requests to correct upstream services. I describe the routing system in detail in Custom Routing. This GC issue is one of the things I ran into while building it:

  • We ran 10 nodes, each with 4 CPU cores.
  • Each processes ~3,000 queries per second (QPS), payload was ~3KB JSON
  • Latency: ~10ms

To make routing fast, we decided to load all the routing rules into a map[string]string when the process started. It worked beautifully at first.

But over time as the routing rule set grew, and things got weird. We noticed periodic CPU spikes across the nodes. At first, the Prometheus monitoring charts looked okay because metrics were averaged over a 5-minute interval, which smoothed out the spikes. Eventually, we saw increased tail latency (P99).

Then I used pprof to see where the CPU time was going. I expected to see JSON parsing taking up the time. Instead, I saw runtime.gcDrainMarkWorker and runtime.gcDrainMarkWorkerIdle dominating the CPU profile. The GC was working overtime just to check a large map what would never run out of scope

Here is the workflow of the old system: Incoming Request -> Go HTTP server -> Lookup rule in Large Map -> Forward to Upstream

Reproducing the Issue

To prove this was the root cause, I wrote a simple script to benchmark the GC pause time with different map sizes.

package main

import (
	"fmt"
	"runtime"
	"time"
)

func run(n int) {
	// Pre-allocate the map to avoid resizing cost during setup
	routes := make(map[string]string, n)

	// Populate the map with n items
	for i := range n {
		routes[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i)
	}

	const runs = 10
	var totalPause time.Duration
	
	// Trigger GC manually and measure how long it takes
	for range runs {
		start := time.Now()
		runtime.GC()
		pause := time.Since(start)
		totalPause += pause
	}

	avgMs := float64(totalPause.Milliseconds()) / float64(runs)
	fmt.Printf("n=%d | avg GC pause=%.3fms\n", n, avgMs)

	// Prevent the map from being garbage collected by compiler optimization
	_ = routes["key-0"] 
}

func main() {
	run(1_000_000)
	run(10_000_000)
	run(20_000_000)
}

Running this script gave very clear results:

% go run ./...
n=1000000 | avg GC pause=10.200ms
n=10000000 | avg GC pause=103.300ms
n=20000000 | avg GC pause=342.600ms

As you can see, the GC pause time grows linearly with the number of items in the map. A 342ms pause in a system that requires sub-100ms latency is an absolute disaster.

And here’s the real flame chart: GC CPU usage before optimization

Finding a Solution

Approach 1: Off-Heap Caching Libraries

I considered using libraries like BigCache or FreeCache. These libraries avoid GC overhead by allocating large byte arrays (which have no pointers) and managing the memory layout themselves

Approach 2: External Store (Redis)

Instead of keeping the data in memory, why not move it out of the Go process entirely? Redis is built exactly for this use case. By moving the routing rules to Redis, we would completely remove the data from Go’s memory space, freeing the GC.

We go with Redis

Both solutions sound good, and we decided to go with Redis. Because it also reduce whole system memory

The new workflow looks like this: Incoming Request -> Go HTTP server -> Lookup rule in Redis -> Forward to Upstream

GC CPU usage after optimization

The results:

  • GC pauses dropped significantly
  • Overall RAM usage decreased: Instead of keeping duplicated data across 10 Go nodes, we kept a copy in Redis. This saved us a lot of infra cost
  • Latency trade-off: Redis adds ~1-2 milliseconds, which is acceptable

Takeaways:

  • Profile early and often
  • Avoid large maps with pointers
  • Consider external stores

AI was used to help refine and polish this article based on factual information

We must discard unread body in Golang

Out of the box, Go’s http.Client is built for high performance. It uses a component called http.Transport to manage a pool of underlying TCP connections.

When a service makes a request to an external API, the Transport checks the pool to see if there is an idle connection waiting to be reused. If there is, it sends the request over that existing connection. This is HTTP Keep-Alive in action. Reusing connections saves the heavy cost of DNS resolution, TCP handshakes, and TLS setup.

When a service finishes reading a response and calls resp.Body.Close(), the Transport takes that connection and puts it back into the pool.

But there is a catch. The connection can only be reused if the server has finished sending the response and the client has finished reading it. If there is leftover data on the wire, the connection is considered “dirty.”

Sometimes, we only care about the HTTP status code. For example, maybe we are pinging a health check endpoint or sending a fire-and-forget webhook. We write something like this:

resp, err := client.Post("https://api.example.com/webhook", "application/json", body)
if err != nil {
    return err
}
defer resp.Body.Close()

// The response body is not needed here; only success status matters.
if resp.StatusCode != http.StatusOK {
    return fmt.Errorf("unexpected status: %d", resp.StatusCode)
}

return nil

Because we did not read the response body to the end, the Go standard library does not know what is still sitting on the incoming network buffer. To prevent corrupted reads for the next request that might try to use this connection, the http.Transport permanently closes the underlying TCP connection and throws it away. As a result, the connection pool becomes useless. For every single request, it establishes a brand new connection.

Proving it with Benchmarks

This can be proven with a simple benchmark. A local HTTP server is set up to return a payload. Then, the httptrace package is used to hook into the GotConn lifecycle event. This tells exactly if a connection was freshly created or reused from the pool.

(Note: Source code for this benchmark is available here)

func doBench(b *testing.B, discard bool) {
	server := setupServer() // Returns an HTTP server sending 32KB of data
	defer server.Close()

	client := &http.Client{}
	var created, reused int

	for b.Loop() {
		trace := &httptrace.ClientTrace{
			GotConn: func(connInfo httptrace.GotConnInfo) {
				if connInfo.Reused {
					reused++
				} else {
					created++
				}
			},
		}

		req, _ := http.NewRequest(http.MethodGet, server.URL, nil)
		req = req.WithContext(httptrace.WithClientTrace(req.Context(), trace))

		resp, err := client.Do(req)
		if err != nil {
			b.Fatal(err)
		}

		// The critical difference
		if discard {
			io.Copy(io.Discard, resp.Body)
		}
		resp.Body.Close()
	}
	b.Logf("Connections Created: %d, Connections Reused: %d", created, reused)
}

Result:

BenchmarkHTTPDiscard
    main_test.go:72: Connections Created: 1, Connections Reused: 25208
BenchmarkHTTPDiscard-8             25209             46355 ns/op           38582 B/op         69 allocs/op
BenchmarkHTTPNoDiscard
    main_test.go:72: Connections Created: 11319, Connections Reused: 0
BenchmarkHTTPNoDiscard-8           11319            106628 ns/op           51070 B/op        131 allocs/op

Summary

When working with net/http in Go, we should never assume defer resp.Body.Close() is a complete solution:

  • We must read the body to EOF
  • Use io.Copy(io.Discard, resp.Body) to safely flush unwanted bytes

AI was used to help refine and polish this article based on factual information

Building a Simple Distributed Task Scheduling System

In this post, I will describe a simple, horizontally scalable distributed task scheduling system using just Go and MongoDB. For what purpose, I can’t say. But it works.

System requirements:

  1. Unpredictable Workloads: All tasks must be executed. A task can take anywhere from 2 seconds to 10 minutes.
  2. Horizontal Scaling: The system scales simply by adding more worker nodes.
  3. Simplicity: Adding new nodes or tasks must be straightforward.

The system has three components:

Task Handler:
This acts as the controller and performs the following tasks:

  • Fetches new tasks from external sources.
  • Inserts them into the MongoDB with a pending status.
  • Scans MongoDB for tasks with a done status and sends the results to external services.
  • Scans MongoDB for tasks with a working status to check if they are overdue (say, 15 minutes). If they are, it resets their status to pending.

Worker:
This component does the heavy lifting and performs the following tasks:

  • Polls MongoDB for tasks with a pending status.
  • Claims a task by atomically changing its status to working using MongoDB’s findOneAndUpdate.
  • Finishes the work and saves the result to the document, updating its status to done.

MongoDB:
It’s MongoDB 😏. It handles atomic operations and storage.

AI was used to help refine and polish this article based on factual information

Matrix-Based Filtering

The Matching Problem

In a Real-Time Bidding (RTB) system, a Demand-Side Platform (DSP) handles thousands of bid requests every second. For each request, we must quickly find which active campaigns want to bid on it.

Each bid request has multiple attributes, like country (“us”, “jp”, “sg”), language (“en”, “fr”, “vn”), device type, and website. At the same time, each campaign has strict targeting rules. One campaign might only target iOS users from the “us” who speak “en”. Another might accept users from anywhere except the “eu”.

We only have ~100ms to find all matching campaigns. A simple approach is to loop through every campaign and check its rules one by one. But this is too slow. Checking 50 rules for thousand of campaigns for every request takes too much time.

A better approach is to use a matrix-based filtering approach. Instead of checking a request against each campaign, we use an inverted index. This means we map request attributes directly to the campaigns looking for them.

The Matrix-Based System

The matching logic uses bitsets. A bitset is an array of bits (0s and 1s). Every campaign gets a fixed ID. In a bitset, the bit at index i always stands for campaign i. And we do not store rules inside campaign objects. Instead, we use separate filters for each attribute, like a Language Filter or a Location Filter. Each filter manages its own include and exclude rules using maps. The map key is the attribute (like “en” or “us”), and the value is a bitset. We evaluate these rules based on simple logic:

  • No include, no exclude: accept all values.
  • No include, has exclude: accept all values except those in exclude list.
  • Has include, no exclude: accept only values in include list.
  • Has include, has exclude: accept only values in include list, that are not in exclude list

To do this fast, each filter keeps three things:

  1. include_map: Maps a value to a bitset of campaigns that want it.
  2. exclude_map: Maps a value to a bitset of campaigns that block it.
  3. empty_include_bitset: A single bitset of campaigns that have no include rules (so they accept anything by default).

Evaluating a Filter

When a bid request comes in with language=“en”, the Language Filter does a fast look-up.

First, it gets include_map["en"] and combines it with empty_include_bitset using a bitwise OR. This gives us a bitset of all campaigns that accept “en”.

Second, it gets exclude_map["en"].

Finally, it removes the excluded campaigns using a bitwise AND NOT. We can write this as one simple formula to find the valid campaigns: (include_map["en"] OR empty_include_bitset) AND NOT exclude_map["en"]

This simple math lets us check the rules for all campaigns at the exact same time.

Combining the Rules

When a bid request starts, we create a main bitset called active_campaigns. We set all bits to 1 because all campaigns start as valid choices. Then, we check each filter. After a filter gives us its result bitset, we use a bitwise AND on active_campaigns to update the state.

Request: loc="jp", lang="en"

[active_campaigns]  1111... (All bits start as 1)
       |
       v
Language Filter -> (include_map["en"] | empty_include) & ~exclude_map["en"]
       |
       v
[active_campaigns]  active & language_result
       |
       v
Location Filter -> (include_map["jp"] | empty_include) & ~exclude_map["jp"]
       |
       v
[active_campaigns]  active & location_result
       |
       v
Final Match -> Campaign Result

Performance Benefits

First, we skip the slow loop over all campaigns. The time it takes now depends only on how many attributes the bid request has, not how many campaigns we run. This keeps the system fast even when we add many more campaigns.

Second, looking up items in a map is very fast. Using bitwise math updates every campaign instantly in O(1) time.

Finally, bitsets use very little memory. We can track many campaigns with just a few bytes. Because it is so small, it fits perfectly inside the fast CPU cache.

Reference:

  • https://github.com/rtbkit/rtbkit/wiki/Filter

AI was used to help refine and polish this article based on factual information