How Telemetry Systems Evolve with Infrastructure: Example Architectures from Startup to Enterprise
In our previous post, we introduced the fundamentals of telemetry—covering logs, metrics, traces, and security monitoring. In this follow-up, we’re shifting from theory to practice: what do telemetry stacks actually look like in real-world environments?
The answer depends heavily on infrastructure. In reality, infrastructure decisions come first, and the telemetry stack adapts to support what’s already in place—not the other way around. Observability evolves as a response to growing scale, complexity, and operational maturity.
In this guide, we walk through example telemetry architectures used by companies at different stages: from lean startups to global enterprises. These are not one-size-fits-all blueprints, but rather representative patterns that teams can adapt to fit their own technical and organizational context.
Small Startups: Simplicity First
Early-stage startups are focused on speed and iteration. Engineering teams are small, budgets are tight, and the priority is finding product-market fit—not managing infrastructure. To reduce operational overhead, most teams opt for fully managed platforms such as AWS Lambda or AWS ECS. These services abstract away most of the infrastructure concerns, allowing teams to ship quickly.
Because the stack relies on AWS-managed services, observability is handled using AWS-native tools. Logs from Lambda and ECS are automatically forwarded to CloudWatch. This provides a minimal but functional centralized logging solution with zero setup overhead.
For metrics, it’s common to integrate a SaaS provider like Datadog. Applications push metrics directly to Datadog, which handles dashboarding, alerts, and visualization out of the box. This avoids the complexity of running a monitoring system in-house.
At this stage, distributed tracing is usually unnecessary. The architecture is simple—often monolithic—and most issues can be diagnosed with logs and metrics alone. Similarly, a full SIEM (Security Information and Event Management) system is rarely present, as security teams have not yet been formed.
This is a lean, low-friction setup that prioritizes speed and simplicity over depth or extensibility.
Medium-Sized Startups: Growing into Kubernetes
As startups grow and stabilize, their infrastructure tends to evolve alongside increasing product scope and team size. Many teams begin adopting Kubernetes—typically via Amazon EKS—not because they need it for telemetry, but to manage a growing number of services and environments in a more standardized way.
The migration to Kubernetes is often motivated by the need to better organize development, staging, and production environments, improve deployment workflows, and gain more precise control over resource usage.
Once applications are running in Kubernetes, telemetry services are often co-located in the cluster. A common example is using Fluent Bit inside the EKS cluster to collect logs. Applications emit logs to SQS, and Fluent Bit pulls from SQS, processes the log entries, and forwards them to either Loki or Elasticsearch. For visualization, Grafana is used with Loki, and Kibana is used with Elasticsearch.
Metrics are still frequently handled via Datadog, especially if it was already in use. Teams may continue pushing metrics to Datadog due to its ease of use and mature dashboards, though concerns about cost begin to emerge as usage grows.
At this stage, distributed tracing is still generally considered optional. Most systems remain understandable through logs and metrics alone, and tracing is usually reserved for high-priority or high-latency services.
A dedicated SIEM system is still uncommon, although some teams begin logging security-relevant events in preparation for audits or future compliance needs.
This is a transitional phase—Kubernetes has become the backbone of the infrastructure, and the telemetry stack evolves to fit naturally into that platform.
Large Startups: Scaling with Structure
In larger startups—often those preparing to support enterprise clients or pursue compliance with standards like SOC 2 or ISO 27001—the infrastructure becomes more distributed and regulated. It’s common to see multiple EKS clusters, EC2 instances, SQS for message handling, and early investments in security tooling.
Dedicated platform and security teams begin to emerge, and infrastructure becomes more formalized and segmented by environment or region.
By now, telemetry is no longer an afterthought—it’s an essential operational component. Logs are still emitted to SQS, then processed by Fluent Bit running in EKS. These are shipped to either Loki or Elasticsearch, depending on the team’s preference. Grafana and Kibana serve as the primary interfaces for developers and SREs.
Metrics collection has often shifted away from SaaS platforms. Applications expose Prometheus-compatible endpoints, and a Prometheus server scrapes these metrics on a defined schedule. Grafana continues to handle the presentation layer, and teams gain more flexibility over alerting and cost control.
Distributed tracing often becomes a critical tool for understanding latency and service dependencies. Many teams begin with a SaaS approach using Datadog’s tracing SDK. Others may opt for open-source solutions like Tempo or Jaeger. Tempo integrates tightly with Grafana and offers native support for trace correlation with logs and metrics, while Jaeger provides mature support for trace collection, storage, and visualization through its dedicated UI.
SIEM capabilities start to solidify. Logs from EKS clusters are forwarded to platforms like Sumo Logic, allowing security teams to begin formal threat detection and incident response workflows.
This stack represents a shift toward observability as a first-class part of infrastructure—optimized not only for uptime, but also for debugging, auditing, and security.
Enterprise Environments: Observability at Scale
Enterprise organizations operate at a different scale entirely. Infrastructure spans multiple AWS accounts, regions, and environments, and often includes hybrid deployments across cloud and on-prem. Common components include Kafka clusters for streaming, dozens of EKS clusters, EC2 fleets, and region-specific services for data locality and compliance.
The infrastructure is built for resilience, scale, and auditability, and typically includes dedicated teams for DevOps, security, networking, compliance, and SRE.
Kafka becomes a central piece of the observability pipeline. Applications emit logs to Kafka topics. Grafana Alloy agents, deployed within EKS clusters, consume those logs and ship them into Loki for storage. Grafana is used for querying, dashboards, and alerting.
Metrics follow a similar pipeline. Instead of pushing metrics directly to Prometheus, applications write them to Kafka topics. Kafka Connect forwards these metrics to an HTTP endpoint where Prometheus can scrape them. Grafana remains the frontend, offering a unified view across teams and services.
For distributed tracing, most enterprise teams lean into fully managed or self-hosted open-source solutions. Services are instrumented using the OpenTelemetry SDK. Traces are collected by a Jaeger or Tempo backend—depending on whether the team prefers Jaeger’s UI or Grafana-native workflows. Both solutions offer support for large-scale trace ingestion, and either can be integrated with existing logging and metrics pipelines.
SIEM systems are fully integrated. Logs may be processed locally for sensitive workloads and mirrored to cloud-based SIEM platforms for centralized threat detection, correlation, and compliance monitoring.
In enterprise environments, the telemetry stack becomes part of the platform fabric. It’s no longer just about visibility—it’s about control, governance, and enabling distributed teams to operate with confidence.
Common Techniques Across All Stages
While infrastructure varies dramatically between company sizes, a few telemetry practices apply broadly.
Structured logging is one of the most important. By logging in a structured format like JSON, you eliminate the need for complex regular expressions and enable downstream systems to parse, index, and query logs more efficiently. For implementation examples, check out our structured logging in Python guide.
Sidecar containers are another common pattern, especially in Kubernetes. They buffer telemetry data, offload processing from the main application, and communicate over shared memory or loopback interfaces for maximum performance.
Regular expressions still have a place—particularly when you don’t control the source code of the system generating logs. For example, when ingesting logs from legacy systems or third-party services, you often have no choice but to parse unstructured output. In these cases, regex becomes a tool of necessity. However, it should be used with precision. Always anchor patterns using ^
and $
to reduce false positives and improve performance. And where possible, migrate toward structured log formats for greater long-term flexibility.
Conclusion
The architectures in this post represent a typical progression, but not a universal one. Your company’s path to observability will depend on many factors—team size, regulatory requirements, budget constraints, engineering culture, and the complexity of your systems.
In some cases, you might adopt Kubernetes early and deploy telemetry alongside it. In others, you might stick with fully managed services for years and only introduce Prometheus or Jaeger when absolutely necessary. There’s no “correct” order—just trade-offs that align (or don’t) with your context.
What we’ve shared here are practical examples of how telemetry systems tend to evolve in response to infrastructure growth, not as a driving force behind it. Logs, metrics, and traces are layered in gradually, based on what the system and the teams operating it actually need.
Use these examples as a reference, not a rulebook. Start small, adapt as you grow, and build the telemetry stack that makes sense for your systems.
Happy engineering!