Skip to content

Streaming Ingestion Patterns (Flink, Kafka Streams, Kinesis, Pulsar)

This document investigates stream processing frameworks and managed streaming services as alternatives or complements to bxb’s chosen Kafka → ClickHouse ingestion architecture. It evaluates Apache Flink, Kafka Streams, AWS Kinesis, and Apache Pulsar — covering their trade-offs for real-time event processing at 10k–100k events/sec.


Streaming ingestion patterns add a real-time processing layer between event producers and the analytical data store. Instead of simply buffering events (as Kafka does in bxb’s current architecture), a stream processor applies transformations, enrichment, aggregation, and filtering in flight.

Pattern (Stream Processing):

API Server → Kafka → Stream Processor (Flink / Kafka Streams) → ClickHouse

Pattern (Managed Streaming):

API Server → Kinesis Data Streams → Kinesis Firehose → ClickHouse / S3

Pattern (Pulsar-based):

API Server → Pulsar → Pulsar Functions → ClickHouse

Compared to bxb’s chosen pattern:

API Server → Kafka → Batch Consumer → ClickHouse

The key distinction: bxb’s current batch consumer is a simple process that reads from Kafka and writes to ClickHouse. A stream processor adds stateful computation — windowed aggregations, joins, deduplication, and complex event processing — between the broker and the sink.


Apache Flink is a distributed stream processing framework designed for stateful computations over unbounded data streams. It is the de facto standard for high-throughput, low-latency stream processing in production deployments.

┌─────────────────────────┐
│ JobManager │
│ (coordinator/scheduler) │
└─────┬───────┬───────┬──┘
│ │ │
┌─────▼┐ ┌───▼──┐ ┌──▼────┐
│ Task │ │ Task │ │ Task │
│Mgr 1 │ │Mgr 2 │ │Mgr 3 │
│ │ │ │ │ │
│[slot]│ │[slot]│ │[slot] │
│[slot]│ │[slot]│ │[slot] │
└──────┘ └──────┘ └───────┘
  • JobManager: Coordinates job execution, manages checkpoints, handles failover.
  • TaskManagers: Worker processes that execute stream operators. Each has a fixed number of task slots.
  • Task Slots: Units of resource isolation within a TaskManager (memory + CPU fraction).

Flink’s core strength is managing large-scale distributed state:

// Stateful deduplication using Flink's keyed state
public class EventDeduplicator extends KeyedProcessFunction<String, Event, Event> {
private ValueState<Boolean> seenState;
@Override
public void open(Configuration parameters) {
ValueStateDescriptor<Boolean> descriptor =
new ValueStateDescriptor<>("seen", Boolean.class);
// State TTL: auto-expire after 24 hours to bound memory
StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(24))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.cleanupFullSnapshot()
.build();
descriptor.enableTimeToLive(ttlConfig);
seenState = getRuntimeContext().getState(descriptor);
}
@Override
public void processElement(Event event, Context ctx, Collector<Event> out)
throws Exception {
String txnId = event.getTransactionId();
if (seenState.value() == null) {
seenState.update(true);
out.collect(event); // First occurrence — emit
}
// Duplicate — drop silently
}
}

State backends:

BackendStorageUse CaseState Size
HashMapStateBackendJVM heapLow-latency, small state< ~5 GB per TaskManager
EmbeddedRocksDBStateBackendRocksDB (disk + memory)Large state, incremental checkpointsTerabytes per TaskManager

For bxb’s deduplication use case (tracking transaction_id for 24 hours at 10k/sec = ~864M keys/day), RocksDB is required — heap-based state would cause OOM.

Flink provides exactly-once processing guarantees through a combination of checkpointing and two-phase commit:

Checkpointing:

  1. JobManager injects a checkpoint barrier into the source streams.
  2. Barriers flow through the DAG. When an operator receives barriers from all inputs, it snapshots its state.
  3. State snapshots are persisted to a durable store (S3, HDFS, or GCS).
  4. On failure, Flink restores state from the last completed checkpoint and replays events from the source offset.

Two-Phase Commit (for sinks):

  • Flink’s TwoPhaseCommitSinkFunction ensures that sink writes are committed atomically with checkpoint completion.
  • For ClickHouse: no native two-phase commit support. The practical approach is idempotent writes with ReplacingMergeTree deduplication.
  • For Kafka sinks: Flink supports Kafka transactions, providing true exactly-once end-to-end.

Checkpoint configuration:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000); // Checkpoint every 60 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);
env.getCheckpointConfig().setCheckpointTimeout(120_000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// Persist checkpoints to S3
env.getCheckpointConfig().setCheckpointStorage("s3://flink-checkpoints/");

Checkpoint overhead:

  • At 10k events/sec with 60-second intervals: ~600k events buffered per checkpoint cycle.
  • RocksDB incremental checkpoints: typically <1 second for moderate state sizes.
  • Full checkpoints on large state (>100 GB): can take minutes and cause backpressure.

Flink provides rich windowing semantics for time-based aggregations:

// Tumbling window: aggregate events every 5 minutes
DataStream<UsageAggregate> aggregated = events
.keyBy(Event::getOrganizationId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new UsageAggregator());
// Sliding window: compute rolling 1-hour sum, updated every minute
DataStream<UsageAggregate> rolling = events
.keyBy(Event::getOrganizationId)
.window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(1)))
.aggregate(new UsageAggregator());
// Session window: group events with gaps < 30 minutes
DataStream<UsageAggregate> sessions = events
.keyBy(Event::getExternalCustomerId)
.window(EventTimeSessionWindows.withGap(Time.minutes(30)))
.aggregate(new UsageAggregator());

Window types:

TypeBehaviorUse Case
TumblingFixed, non-overlapping intervalsHourly billing aggregation
SlidingOverlapping intervalsRolling averages, rate limiting
SessionActivity-based, dynamic gapsUser session analytics
GlobalSingle window per key, custom triggersAccumulate until explicit flush

Late-arrival handling:

// Allow events up to 5 minutes late
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.allowedLateness(Time.minutes(5))
.sideOutputLateData(lateOutputTag) // Route very late events to side output

For teams that prefer SQL over Java/Python APIs:

-- Real-time aggregation using Flink SQL
CREATE TABLE events (
organization_id STRING,
code STRING,
value DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH (
'connector' = 'kafka',
'topic' = 'billing-events',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
);
CREATE TABLE hourly_usage (
window_start TIMESTAMP(3),
organization_id STRING,
code STRING,
event_count BIGINT,
total_value DOUBLE
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:clickhouse://clickhouse:8123/default',
'table-name' = 'hourly_usage'
);
INSERT INTO hourly_usage
SELECT
TUMBLE_START(event_time, INTERVAL '1' HOUR) AS window_start,
organization_id,
code,
COUNT(*) AS event_count,
SUM(value) AS total_value
FROM events
GROUP BY
TUMBLE(event_time, INTERVAL '1' HOUR),
organization_id,
code;
MetricValueConditions
Throughput (single node)1–5M events/secStateless transformations
Throughput (stateful, RocksDB)100k–1M events/secKeyed state, checkpointing enabled
Latency (event-to-output)10–100msWith checkpointing; lower without
Checkpoint duration1–10 secIncremental, moderate state
State size supportedTerabytesRocksDB backend, S3 checkpoints
Recovery time10–60 secFrom latest checkpoint
OptionOps ComplexityCostNotes
Self-managed (Kubernetes)HighLow (compute only)Full control, requires Flink expertise
Amazon Managed Flink (formerly Kinesis Data Analytics)LowMedium–HighServerless, auto-scaling
Confluent Cloud (Flink)LowHighIntegrated with Confluent Kafka
Ververica PlatformMediumMediumCommercial Flink management

Kafka Streams is a client library for building stream processing applications on top of Apache Kafka. Unlike Flink (which is a separate distributed system), Kafka Streams runs as part of the application process — no separate cluster to deploy.

┌──────────────────────────────────────────┐
│ Application Process │
│ │
│ ┌────────────┐ ┌──────────────────┐ │
│ │ Kafka │ │ State Store │ │
│ │ Consumer │───▶│ (RocksDB/ │ │
│ │ │ │ In-Memory) │ │
│ └────────────┘ └──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────┐ ┌──────────────────┐ │
│ │ Stream │ │ Kafka Producer │ │
│ │ Topology │───▶│ (output topics) │ │
│ └────────────┘ └──────────────────┘ │
└──────────────────────────────────────────┘
  • No separate cluster: Kafka Streams is a library embedded in the application JVM.
  • Scaling: Deploy multiple application instances. Kafka Streams uses Kafka’s consumer group protocol to distribute partitions across instances.
  • State stores: Backed by RocksDB (default) with changelog topics in Kafka for fault tolerance.
StreamsBuilder builder = new StreamsBuilder();
// Read from input topic
KStream<String, Event> events = builder.stream("billing-events");
// Stateless transformation: enrich events
KStream<String, EnrichedEvent> enriched = events.mapValues(event -> {
return new EnrichedEvent(event, lookupCustomerTier(event.getCustomerId()));
});
// Stateful aggregation: count events per org per hour
KTable<Windowed<String>, Long> hourlyCounts = enriched
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
.count(Materialized.as("hourly-counts"));
// Write aggregations to output topic
hourlyCounts.toStream()
.map((windowedKey, count) -> KeyValue.pair(
windowedKey.key(),
new UsageAggregate(windowedKey.window().start(), windowedKey.key(), count)
))
.to("usage-aggregates");
FeatureKafka StreamsFlink
State backendRocksDB or in-memoryRocksDB or HashMapState
State persistenceChangelog topics in KafkaCheckpoints to S3/HDFS
Recovery mechanismReplay changelog topicRestore from checkpoint + replay
Recovery timeMinutes (rebuilds from changelog)Seconds (restores snapshot)
Standby replicasYes (num.standby.replicas)Yes (via checkpoint)
State sizeLimited by local diskTerabytes (RocksDB + S3)

Changelog topics: Every state mutation is logged to a Kafka topic (e.g., app-hourly-counts-changelog). On failure, the new instance replays the changelog to rebuild state. This can take minutes for large state stores.

Standby replicas: Configure num.standby.replicas=1 to maintain hot standby copies of state stores, reducing recovery time to seconds.

Kafka Streams supports exactly-once processing via Kafka’s transactional protocol:

Properties props = new Properties();
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG,
StreamsConfig.EXACTLY_ONCE_V2); // Requires Kafka 2.5+
  • How it works: Kafka Streams wraps each processing step (read → transform → state update → write) in a Kafka transaction. Either all changes commit atomically, or none do.
  • Throughput impact: ~10–30% reduction vs. at-least-once.
  • Limitation: Only guarantees exactly-once within the Kafka ecosystem. Writing to an external system (ClickHouse) is at-least-once unless the sink is idempotent.
DimensionKafka StreamsFlink
DeploymentLibrary in your app (no cluster)Separate distributed system
Ops overheadLow (just Kafka)High (JobManager + TaskManagers)
Throughput10k–500k events/sec per instance100k–5M events/sec per node
State sizeGBs (limited by local disk)TBs (RocksDB + remote storage)
WindowingBasic (tumbling, hopping, session)Rich (custom triggers, late data)
Multi-source joinsKafka topics onlyKafka, files, databases, sockets
SQL supportksqlDB (separate component)Flink SQL (built-in)
Best forSimple enrichment, filtering, small aggregationsComplex event processing, large state

For bxb: If the stream processing needs are limited to deduplication, enrichment, and basic aggregations, Kafka Streams is the simpler choice. If complex windowed joins, large-state computations, or multi-source processing is needed, Flink is the better fit.


AWS Kinesis Data Streams and Kinesis Firehose

Section titled “AWS Kinesis Data Streams and Kinesis Firehose”

AWS Kinesis provides a fully managed streaming platform as an alternative to self-managed Kafka. It consists of two primary services:

  • Kinesis Data Streams (KDS): A real-time data streaming service (analogous to Kafka topics).
  • Kinesis Data Firehose: A managed ETL/delivery service that loads streaming data into destinations (S3, Redshift, OpenSearch, HTTP endpoints).

Architecture:

Producers → Kinesis Data Stream (shards) → Consumers
├── Lambda
├── KCL Application
└── Firehose

Key concepts:

ConceptDescription
ShardUnit of throughput capacity. Each shard: 1 MB/sec write, 2 MB/sec read.
Partition keyDetermines shard assignment (hash-based). Analogous to Kafka partition key.
Retention24 hours default, up to 365 days (extended retention).
Enhanced fan-outDedicated 2 MB/sec per consumer per shard (vs. shared 2 MB/sec).
On-demand modeAuto-scales shards based on throughput (no capacity planning).

Capacity planning for bxb (10k events/sec):

ModeShards NeededMonthly Cost (estimate)
Provisioned10 shards (at ~1k events/sec per shard, 1 KB avg)~$365/month
On-demandAuto-scaled~$450/month (pay per GB ingested)

vs. Kafka (self-managed):

DimensionKinesis Data StreamsKafka (self-managed)
Ops overheadZero (fully managed)High (broker management, ZooKeeper/KRaft)
Throughput per shard/partition1 MB/sec write10+ MB/sec per partition
Consumer modelKCL, Lambda, enhanced fan-outConsumer groups, flexible
RetentionUp to 365 daysUnlimited (disk-bound)
Cost at 10k/sec~$400–600/month~$600–900/month (3 brokers)
ReplayYes (by timestamp or sequence number)Yes (by offset)
EcosystemAWS-native (Lambda, Firehose, Analytics)Broader ecosystem (Connect, Streams, Flink)

Firehose provides managed batch delivery from streaming sources to destinations — no code required for the delivery pipeline.

Architecture:

Kinesis Data Stream ──┐
├──▶ Firehose ──▶ S3 / Redshift / OpenSearch / HTTP
Direct PUT ──────────┘ │
├── Optional: Lambda transformation
├── Buffering (size or time)
└── Compression (gzip, snappy, zip)

Key features:

FeatureDetail
BufferingBuffer by size (1–128 MB) or time (60–900 seconds) before delivery
TransformationInvoke Lambda for record transformation (enrich, filter, format)
Compressiongzip, Snappy, Zip, Hadoop-compatible Snappy
Format conversionJSON → Parquet/ORC (using Glue Data Catalog schema)
Error handlingFailed records to S3 error bucket
Delivery guaranteeAt-least-once

ClickHouse delivery via Firehose:

Firehose does not have a native ClickHouse destination. Options:

  1. HTTP endpoint destination: Firehose delivers batches to an HTTP endpoint → write a small service that receives batches and inserts into ClickHouse.
  2. S3 + ClickHouse S3 table function: Firehose writes Parquet to S3 → ClickHouse reads via s3() table function or S3Queue engine.
  3. S3 + ClickHouse S3Queue engine: ClickHouse continuously polls an S3 prefix for new files and ingests them.
-- ClickHouse S3Queue engine: auto-ingest from S3
CREATE TABLE events_s3_queue (
organization_id String,
transaction_id String,
external_customer_id String,
code String,
timestamp DateTime64(3),
properties String,
value Nullable(Float64),
decimal_value Nullable(Decimal128(18))
) ENGINE = S3Queue(
'https://s3.amazonaws.com/billing-events/firehose/*',
'JSONEachRow'
) SETTINGS
mode = 'unordered',
s3queue_polling_min_timeout_ms = 5000,
s3queue_polling_max_timeout_ms = 30000;
-- Materialized view to move data into MergeTree
CREATE MATERIALIZED VIEW events_from_s3 TO events_raw AS
SELECT * FROM events_s3_queue;

Latency considerations:

PathEnd-to-End Latency
Firehose → S3 (60s buffer) → S3Queue (30s poll)~90–120 seconds
Firehose → HTTP endpoint (60s buffer)~60–90 seconds
Kafka → Consumer → ClickHouse (batch every 5s)~5–10 seconds

Firehose adds significant latency due to its buffering model. For bxb’s 1–2 minute latency target, Firehose → S3 → S3Queue is marginal; Kafka → Consumer is well within budget.


Apache Pulsar is a distributed messaging and streaming platform originally developed at Yahoo. It separates the serving layer (brokers) from the storage layer (Apache BookKeeper), enabling independent scaling of compute and storage.

┌──────────────────┐
│ Pulsar Broker │
│ (stateless) │
└──┬──────────┬───┘
│ │
┌────────▼┐ ┌───▼───────┐
│BookKeeper│ │BookKeeper │
│ Bookie 1 │ │ Bookie 2 │
└──────────┘ └───────────┘
│ │
┌──▼──────────▼──┐
│ Tiered Storage │
│ (S3 / GCS) │
└────────────────┘
  • Brokers (stateless): Handle produce/consume requests, topic lookup, load balancing. Easily scaled horizontally.
  • BookKeeper Bookies (stateful): Persist messages in a write-ahead log. Provide low-latency durable writes.
  • Tiered storage: Offload older segments to S3/GCS for cost-efficient long-term retention.

Pulsar has native multi-tenancy built into its topic hierarchy:

persistent://tenant/namespace/topic
LevelPurposeExample
TenantOrganizational boundarybilling-platform
NamespaceLogical grouping with shared policiesbilling-platform/production
TopicIndividual message streambilling-platform/production/events

Isolation policies:

  • Per-namespace: retention, TTL, replication, backlog quota, schema enforcement.
  • Per-tenant: authentication, authorization, resource quotas.
  • Kafka comparison: Kafka has no native multi-tenancy. Isolation requires separate clusters or complex ACL configurations.

Pulsar’s tiered storage offloads older topic segments to object storage:

Hot data (recent): BookKeeper (SSD) → low latency, higher cost
Warm data (older): S3 / GCS → higher latency, much lower cost

Configuration:

  • Offload policy: by size (e.g., offload when topic exceeds 10 GB on BookKeeper) or by time (e.g., offload segments older than 24 hours).
  • Transparent reads: Consumers reading offloaded data see no API difference — Pulsar fetches from object storage transparently.
  • Cost benefit: Retain months/years of event data at S3 prices ($0.023/GB/month) instead of SSD prices ($0.10–0.20/GB/month).

Kafka comparison: Kafka added tiered storage in KIP-405 (GA in Kafka 3.6+), but it is less mature than Pulsar’s implementation. Confluent Cloud offers it as a managed feature.

Pulsar supports synchronous and asynchronous geo-replication at the namespace level:

┌─────────────┐ ┌─────────────┐
│ Cluster A │◄───────▶│ Cluster B │
│ (us-east) │ async │ (eu-west) │
└─────────────┘ replic │ │
└─────────────┘

Modes:

  • Async replication: Messages produced in Cluster A are asynchronously replicated to Cluster B. Sub-second replication lag under normal conditions.
  • Sync replication: Acknowledge produce only after replicated. Higher latency but stronger durability.
  • Active-active: Both clusters accept produces and replicate to each other. Requires conflict resolution for ordering.

Kafka comparison: Kafka geo-replication options:

  • MirrorMaker 2: Async replication between Kafka clusters. Works but limited (topic-level, no active-active).
  • Confluent Cluster Linking: Managed, lower latency, but Confluent-only.

Pulsar Functions (Lightweight Stream Processing)

Section titled “Pulsar Functions (Lightweight Stream Processing)”

Pulsar Functions provide serverless-style stream processing:

# Pulsar Function: enrich billing events
from pulsar import Function
class EnrichEvent(Function):
def process(self, event, context):
enriched = {
**event,
'customer_tier': lookup_tier(event['external_customer_id']),
'enriched_at': datetime.utcnow().isoformat(),
}
return enriched
  • Deployment: Run as threads (in broker process), processes, or Kubernetes pods.
  • Comparison to Kafka Streams: Simpler (single-function model) but less powerful (no multi-step topologies, limited windowing).
  • Comparison to Flink: Much simpler but no support for complex stateful processing, large state, or sophisticated windowing.
DimensionApache PulsarApache Kafka
ArchitectureSeparate compute (brokers) and storage (BookKeeper)Brokers handle both compute and storage
ScalingScale brokers and storage independentlyScale brokers (both together)
Multi-tenancyNative (tenant/namespace/topic)Not native (requires ACLs, separate clusters)
Tiered storageMature, built-inKIP-405 (Kafka 3.6+), less mature
Geo-replicationBuilt-in, namespace-levelMirrorMaker 2 or Confluent Cluster Linking
Consumer modelExclusive, shared, failover, key-sharedConsumer groups only
Message orderingPer-key (key-shared), per-subscription (exclusive)Per-partition
Max throughput~1–2M messages/sec per topic~2–5M messages/sec per partition
Latency (p99)5–10ms (publish)2–5ms (publish)
Ecosystem maturitySmaller, growingVery large, battle-tested
Managed offeringsStreamNative Cloud, limitedConfluent, MSK, Redpanda, many
CommunitySmaller, ASF-governedVery large, industry-standard
ScenarioPulsar Advantage
Multi-tenant SaaSNative tenant isolation without cluster-per-tenant
Long retention (months/years)Tiered storage to S3 at low cost
Geo-distributed workloadsBuilt-in async/sync geo-replication
Independent compute/storage scalingScale brokers without moving data
Mixed workloads (streaming + queuing)Shared subscriptions for queue semantics

For bxb: Kafka is the stronger choice today. bxb is single-tenant, doesn’t require geo-replication, and benefits from Kafka’s larger ecosystem (Kafka Connect, schema registry, Flink integration). Pulsar would be worth reconsidering if bxb needs to serve multiple tenants with isolated event streams or requires very long-term event retention at low cost.


  • Sub-second transformations: Events can be enriched, validated, and routed within milliseconds.
  • Streaming aggregations: Compute running totals, rates, and summaries without batch delays.
  • Immediate alerting: Detect anomalies (usage spikes, fraud patterns) as they happen, not minutes later.
  • Stream-stream joins: Correlate events across multiple streams (e.g., join API calls with billing events by customer ID within a time window).
  • Pattern detection: CEP (Complex Event Processing) detects sequences like “3 failed payments in 5 minutes” using Flink CEP or stateful operators.
  • Sessionization: Group related events into sessions based on activity gaps.
  • Watermarks: Flink’s watermark mechanism tracks event-time progress, allowing the system to handle out-of-order events gracefully.
  • Allowed lateness: Windows can accept late events up to a configurable threshold, recomputing aggregates as needed.
  • Side outputs: Very late events (beyond the allowed lateness) are routed to a side output for manual handling or separate processing.
  • For billing: Late-arriving usage events (e.g., from offline devices) are correctly attributed to the right billing period.
  • Horizontal scaling: Both Flink and Kafka Streams scale horizontally by adding instances.
  • Flink auto-scaling: Reactive mode adjusts parallelism based on backpressure (Kubernetes-based).
  • Partition-level parallelism: Processing parallelism is bounded by the number of Kafka partitions, allowing fine-grained control.
  • State partitioning: Stateful computations are distributed across nodes based on key, enabling linear scaling for most workloads.
  • Separation of concerns: The ingestion layer (Kafka), processing layer (Flink/Kafka Streams), and storage layer (ClickHouse) are independently scalable and replaceable.
  • Multiple outputs: A single stream processor can write to multiple sinks (ClickHouse for analytics, PostgreSQL for transactional records, S3 for archival).
  • Pipeline evolution: Add new processing steps (fraud detection, rate limiting) without modifying the ingestion API.

  • Flink cluster management: JobManager + TaskManagers + state checkpoints + monitoring. Requires dedicated expertise.
  • Failure modes: Checkpoint failures, state corruption, rebalancing delays, backpressure propagation.
  • Kafka Streams complexity: Simpler than Flink but still requires understanding of topology, state store recovery, and repartitioning.
  • Debugging: Distributed stateful stream processing is inherently harder to debug than batch processing. State inspection tools are immature.
  • Flink: Requires a dedicated cluster (or managed service). Adds JobManager, TaskManagers, ZooKeeper (or standalone HA), and checkpoint storage.
  • Infrastructure at 10k/sec:
ComponentInstancesEstimated Cost
Flink JobManager1 (HA: 2)~$100–200/month
Flink TaskManagers2–4~$400–800/month
Checkpoint storage (S3)N/A~$5–10/month
Total Flink overhead~$500–1,000/month
  • vs. simple Kafka consumer: bxb’s current batch consumer runs as a single lightweight process. Adding Flink increases infrastructure cost by ~$500–1,000/month with no throughput benefit for simple ingestion.
  • Flink: Java/Scala API, DataStream/Table API, checkpoint tuning, watermarks, state TTL, RocksDB tuning.
  • Kafka Streams: Topology DSL, state stores, GlobalKTable vs. KTable, timestamp extractors.
  • Operational knowledge: Understanding backpressure, checkpoint alignment, event-time vs. processing-time semantics.
  • Team impact: bxb’s backend is Python-based. Introducing Flink requires JVM expertise or using PyFlink (which has limitations — no support for certain state backends, fewer connectors).
  • At bxb’s current scale (10k/sec): The stream processing overhead is not justified for simple write-through ingestion.
  • Break-even point: Stream processing becomes cost-effective when:
    • Multiple downstream consumers need different transformations (avoid duplicate work).
    • Aggregations that would be expensive in ClickHouse can be pre-computed.
    • Real-time alerting requirements cannot be met by polling ClickHouse.
ScaleSimple Consumer CostFlink CostJustification
10k/sec~$50/month~$700/monthNot justified
50k/sec~$100/month~$1,000/monthOnly if processing needed
100k/sec~$200/month~$1,500/monthJustified if 3+ consumers
  • Within Kafka ecosystem: Exactly-once is well-supported (Kafka transactions + Flink checkpoints).
  • To external systems (ClickHouse): No native two-phase commit. Must rely on idempotent writes + ReplacingMergeTree deduplication.
  • Practical reality: Most production systems settle for “effectively-once” (at-least-once delivery + idempotent processing) rather than true exactly-once.

Use CaseWhy StreamingExample
Real-time dashboardsSub-second metric updatesLive API usage monitoring
Fraud detectionPattern matching on live eventsDetect anomalous usage spikes
Rate limitingPer-key rate computation in flightEnforce API quotas in real-time
Complex event processingMulti-stream joins, sessionizationCorrelate API calls with billing events
Real-time alertingImmediate response to thresholdsAlert when customer exceeds budget
Event enrichment (multi-source)Join events with reference data streamsAdd customer tier from CRM stream

When to Use Batch (bxb’s Current Approach)

Section titled “When to Use Batch (bxb’s Current Approach)”
Use CaseWhy BatchExample
Billing aggregationHourly/daily billing cycles tolerate latencyMonthly invoice generation
Analytics and reportingDashboards refresh every 1–5 minutesUsage analytics for customer portal
Cost-effective bulk processingSimple consumer is 10x cheaper than FlinkWrite-through to ClickHouse
Historical reprocessingReplay Kafka topic, rebatch into ClickHouseFix a bug in aggregation logic
Simple data pipelineNo transformations needed between source and sinkEvents pass through unchanged
Small team / limited JVM expertiseKafka Streams / Flink require JVM knowledgePython-based team
Need real-time event processing (< 1 sec)?
├── YES: Need complex stateful processing (joins, windows, CEP)?
│ ├── YES: → Apache Flink
│ └── NO: Need to stay within Kafka ecosystem?
│ ├── YES: → Kafka Streams
│ └── NO: → Pulsar Functions or Lambda
└── NO: Latency 1–2 minutes acceptable?
├── YES: Need event replay capability?
│ ├── YES: → Kafka → Simple Consumer → ClickHouse (bxb's choice)
│ └── NO: → Direct ClickHouse ingestion or PostgreSQL
└── NO: (1 sec – 2 min range)
→ Kafka → Consumer with smaller batch windows

For systems that need both real-time and batch capabilities:

┌──────────────────────┐
│ Flink (real-time) │──▶ Alerts / Dashboards
│ - fraud detection │
│ - rate limiting │
└──────────────────────┘
API Server ──▶ Kafka ──────────────────┤
┌──────────────────────┐
│ Batch Consumer │──▶ ClickHouse
│ (bulk write) │ (analytics / billing)
└──────────────────────┘

This “lambda-lite” architecture uses Kafka as the single source of truth, with multiple consumers serving different latency requirements. bxb could adopt this incrementally: start with the batch consumer (current plan), add Flink later for specific real-time use cases.


bxb’s chosen architecture is API → Kafka → Simple Consumer → ClickHouse, which is a batch ingestion pattern with a message broker for durability and decoupling. This is the right choice for the current requirements:

  • 10k events/sec target: A simple Kafka consumer can handle this easily.
  • 1–2 minute latency acceptable: No need for sub-second processing.
  • Cost-effectiveness: A batch consumer costs ~$50/month vs. ~$700+/month for Flink.
  • Team skill set: Python-based team; no JVM expertise required for simple consumer.

Streaming ingestion should be reconsidered when:

  1. Real-time requirements emerge: If bxb needs sub-second event processing (e.g., real-time fraud detection, live rate limiting, instant budget alerts), a stream processor becomes necessary.
  2. Multiple downstream consumers: If 3+ consumers need different transformations of the same event stream, a stream processor centralizes the logic and avoids duplicating work.
  3. Complex event correlation: If billing requires joining events across multiple streams or detecting multi-event patterns, Kafka Streams or Flink is needed.
  4. Scale beyond 50k/sec with processing: At very high scale, pre-aggregating in Flink reduces the write load on ClickHouse and the cost of aggregation queries.
PhaseScaleArchitectureStream Processing
Current10k/secKafka → Consumer → ClickHouseNone (batch consumer)
Phase 210–50k/secSame + more partitions/consumersKafka Streams for simple enrichment
Phase 350–100k/secSame + Flink for pre-aggregationFlink for windowed aggregation
Phase 4100k+/secMulti-cluster Kafka + FlinkFlink for complex processing + routing

Don’t add stream processing until you need it. The operational and cost overhead of Flink or Kafka Streams is not justified for bxb’s current use case (simple write-through ingestion). Kafka’s consumer API with batch writes to ClickHouse is sufficient, cost-effective, and operationally simple. When real-time processing needs arise, Kafka Streams (for simple cases) or Flink (for complex cases) can be added incrementally without changing the upstream architecture.