AI & Data Infrastructure Architect · Distributed Systems · Cloud Platforms · Security Analytics · Scalable ML

Make AI, cloud, and data infrastructure ROI visible.

I help teams reduce infrastructure waste, control observability and data-platform costs, and build systems that scale without costs becoming unpredictable.

Daniel Brener
Daniel Brener
AI & Data Infrastructure Architect

The real cost of invisible infrastructure

Most waste is invisible until it becomes a budget problem.

Five infrastructure layers, all flowing into the same bill. Width shows current cost share. Animation speed reflects growth rate. Most teams know the total. Few know which layer is accelerating.

Observability38% · ↑ 280%
AI Runtime24% · ↑ 410%
Data Platform18% · ↑ 85%
Kubernetes12% · stable
Storage8% · ↑ 190%
infrastructure cost breakdown
growing ↑
INFRACOSTOBS38% · ↑ 280%AI24% · ↑ 410%DATA18% · ↑ 85%K8S12% · stableSTORE8% · ↑ 190%
width = cost sharespeed = growth rate

Telemetry — mostly noise

~85% of ingested telemetry is never operationally queried.

WASTENOISYMARGINALEFFICIENTOPTIMAL18%
IN WASTE ZONE
typical observability setup

AI runtime — the model isn't the cost

Context, retrieval, traces, and storage usually exceed model inference spend.

model inference
actual compute12%
context / prompt
token accumulation29%
retrieval / RAG
vector + rerank21%
traces / spans
observability20%
embedding storage
growing silently18%

model inference = 12% of total AI cost

Cloud bill — can't attribute most of it

Most teams can't map infrastructure spend to workloads, features, or teams.

?
k8s/prod
?
?
kafka
?
?
?
?
?
ai-train
?
?
?
?
?
?
clickhs
?
?
?
?
?
?
?
api-gw
?
?
?
?
?
?
?
?
?
?
?
?
?
?
10 attributed (22%)
35 unknown

Data pipelines — low-value event load

Retries, heartbeats, and debug events compound storage and compute cost.

actual product events22%
retries / duplicates31%
debug / internal state28%
heartbeats / misc19%

<25% of pipeline volume is actual product data

System scale — cost visibility lags behind

By the time the bill is a problem, the architecture is expensive to change.

Focus areas

Where systems get expensive, noisy, or hard to control.

Observability cost & ingestion design

Sampling, cardinality reduction, retention tiers. Reduce ingestion without losing operational coverage.

AI infrastructure overhead

Context management, retrieval cost, trace volume, embedding storage. Most of it isn't the model.

Data platform efficiency

Kafka, Flink, Spark, ClickHouse — ingestion design, backpressure, storage cost, event value.

Infrastructure cost attribution

Map cloud, Kubernetes, data, and AI spend to workloads, teams, and features.

Security telemetry correlation

Normalize and correlate findings so teams prioritize real risk rather than ingesting alert volume.

Cost & architecture

Same pattern, different system.

Most of these problems are predictable once you've seen them a few times.

01

Most teams collect far more telemetry than they operationally use.

02

The expensive part usually isn't the model. It's the context, traces, and retry loops around it.

03

Retention decisions get deferred until storage is already a cost problem.

04

Observability is a data architecture problem, not a tooling choice.

05

Many teams can scale infrastructure faster than they can explain the bill.

06

If you can't trace cost to a workload, you're guessing at what to fix.

Stack

Tools, not the strategy.

The value is owning the telemetry model and cost structure — not the tools.

OpenTelemetryClickHouseGrafanaLokiTempoVictoriaMetricsKafkaKubernetesPostgresRedisGoPython

Representative outcomes

What this work looks like.

No invented clients, no inflated numbers.

Identified high-cardinality telemetry causing disproportionate ingestion cost — reduced volume without operational coverage gaps.

Redesigned pipeline ingestion and backpressure — improved throughput while reducing compute and storage overhead.

Built workload-level cost attribution across cloud and Kubernetes infrastructure.

Mapped AI runtime cost across context, retrieval, traces, and storage.

Improved signal correlation across operational, security, and infrastructure data.

Reduced agent orchestration overhead by redesigning context management and tool-call patterns.

About

Daniel Brener

AI & Data Infrastructure Architect focused on distributed systems, cloud platforms, security analytics, and scalable ML infrastructure. Most of my work involves making infrastructure economics readable — where systems become expensive, noisy, or hard to control, and designing toward something clearer.

Connect

LinkedIn

linkedin.com/in/daniel-b-7b297a62

Primary domains

Distributed systems & cloud platforms
Security analytics & telemetry
ML infrastructure & AI runtime
Observability economics
Data platform architecture

Infrastructure tends to get expensive in specific, predictable ways.

I'm interested in practical conversations around cloud cost visibility, observability economics, AI infrastructure overhead, and data-platform efficiency. No pitch — just an exchange about the specific problem you're working on.

Connect on LinkedInlinkedin.com/in/daniel-b-7b297a62