Skip to content

Event Ingestion Architecture: Executive Summary

Event Ingestion Architecture: Executive Summary

Section titled “Event Ingestion Architecture: Executive Summary”

bxb is a usage-based billing platform that must ingest, store, and aggregate 10,000 events per second (864 million events/day) with zero silent data loss. Every event directly affects revenue — lost or miscounted events mean incorrect invoices. The existing PostgreSQL-only architecture is approaching its write ceiling, and aggregation queries over hundreds of millions of rows take minutes instead of milliseconds. A new ingestion architecture is needed that handles current volume with headroom to grow, while keeping infrastructure costs reasonable.

Chosen Solution: API → Kafka → ClickHouse

Section titled “Chosen Solution: API → Kafka → ClickHouse”

We selected a Kafka-mediated batch pipeline into ClickHouse for analytical storage: the API publishes events to a 3-broker Kafka cluster, a Python batch consumer accumulates events into batches of 5,000–10,000 rows, and bulk-inserts them into ClickHouse every 1–5 seconds. PostgreSQL remains the transactional source of truth; ClickHouse powers all billing aggregation queries with 40–100x speedup over PostgreSQL.

API Server ──▶ Kafka (3 brokers, 12 partitions) ──▶ Batch Consumer ──▶ ClickHouse
│ │
└──▶ PostgreSQL (source of truth) Aggregation Queries ◀──┘
MetricValue
Throughput10,000 events/sec sustained (headroom to 50–100k/sec)
Ingestion-to-query latency10–30 seconds typical; <2 minutes worst case
Infrastructure cost~$1,000/month at 10k/sec
Aggregation query speed10–500 ms (40–100x faster than PostgreSQL)
Data durabilityAt-least-once delivery (Kafka acks=all + ClickHouse ReplacingMergeTree dedup)
Event replayFull replay from any Kafka offset (7-day retention)

We evaluated five ingestion patterns and chose Kafka → ClickHouse (P3) because:

  1. Event replay is non-negotiable for billing. Kafka retains events for replay — essential for rebuilding materialized views and fixing aggregation bugs. Direct ClickHouse writes (P4) scored equally on throughput and cost but cannot replay events.
  2. Sufficient throughput with clear scaling path. P3 handles 50–100k/sec by adding Kafka partitions and consumer instances. PostgreSQL-based patterns (P1, P2) hit their write ceiling at our current target rate.
  3. Right-sized complexity and cost. Apache Flink (P5) adds ~$700/month and JVM operational burden with no benefit for our simple write-through use case. Flink can be added later if real-time stream processing requirements emerge.
GoalStatus
Sustain 10,000 events/sec with <5% CPU utilization on ClickHouseAchieved (5–15% CPU)
Aggregation queries under 500ms at 100M rowsAchieved (10–500ms depending on query type)
No silent event loss; at-least-once delivery end-to-endAchieved (Kafka replication + ReplacingMergeTree dedup)
PostgreSQL fallback for billing queries during ClickHouse downtimeAchieved (automatic fallback path)
Infrastructure cost under $1,500/month at 10k/secAchieved (~$1,000/month)
RiskImpactMitigation
Kafka operational complexity — Small team has limited Kafka experience; broker failures or partition rebalancing could cause ingestion delaysMediumKafka’s well-documented ecosystem reduces learning curve. Consumer lag monitoring with automated alerts at >100k messages. Single consumer group keeps the initial deployment simple. Managed Kafka (e.g., Confluent Cloud, AWS MSK) is a fallback if self-hosting proves burdensome.
ClickHouse eventual dedupReplacingMergeTree deduplicates during background merges, not at insert time; billing queries before merge completion may see duplicate eventsHigh (billing accuracy)Use FINAL modifier on billing-critical queries (invoice generation) to force dedup at read time. Daily reconciliation job compares PostgreSQL and ClickHouse event counts, alerting on >1% drift.
Scaling trigger timing — Delayed response to growing event volume could cause consumer lag buildup and increased ingestion latencyMediumMonitoring alerts on consumer lag (>100k messages), ClickHouse merge backlog (>300 parts), and throughput approaching 2x current capacity. Scaling playbook documented with clear triggers: add consumers at 20k/sec, add ClickHouse nodes at 50k/sec.
  • [[Event-Ingestion-Architecture]] — Full technical blog post with implementation details, code examples, performance benchmarks, and troubleshooting guide
  • [[Ingestion-Pattern-Comparison]] — Detailed comparison matrix and TCO analysis across all five ingestion patterns
  • [[Capacity-Planning]] — Cost vs. throughput projections and resource requirements at 10k, 50k, and 100k events/sec