Check out our latest project โ€” dmp-af.cloud, an open-source orchestration platform for dbt →
Conference Talk

Building a Modern Data Stack with Open-Source Technologies

About This Talk At DoubleCloud 2022, I shared the experience of building a modern analytical infrastructure at Toloka AI using entirely open-source components. The talk covered the architecture decisions, operational lessons, and practical trade-offs of choosing open-source over managed services.

  • Author

    Evgeny Ermakov

  • Category

    Conference Talk

  • Read Time

    2 min read

  • Last updated

    March 15, 2022

About This Talk

At DoubleCloud 2022, I shared the experience of building a modern analytical infrastructure at Toloka AI using entirely open-source components. The talk covered the architecture decisions, operational lessons, and practical trade-offs of choosing open-source over managed services.

Key Ideas

The Open-Source Data Stack โ€” The architecture centered on Kafka for data ingestion and event streaming, ClickHouse for analytical storage and fast queries, and custom orchestration connecting the components. Each tool was chosen for its specific strengths rather than adopting a single vendor’s ecosystem.

ClickHouse in Production โ€” Running ClickHouse at scale requires attention to cluster topology, replication strategy, sharding scheme, and materialized view design. The talk covered practical lessons: how we designed our table schemas, managed data lifecycle, and used materialized views for real-time aggregation without impacting query performance.

Kafka as the Data Backbone โ€” Kafka served as the central nervous system of the data platform. We covered topic design principles, partitioning strategies for parallelism, consumer group management, and the Kafka-ClickHouse integration via the Kafka table engine โ€” enabling near-real-time data availability.

Operational Maturity โ€” Building the stack was the easy part. The talk dedicated significant time to operational concerns: monitoring strategies, alerting configurations, capacity planning, schema evolution processes, and the organizational practices that keep an open-source stack running reliably.

Why It Matters

The modern data stack doesn’t have to mean expensive SaaS tools. With the right engineering investment, open-source technologies like ClickHouse and Kafka can deliver enterprise-grade analytics capabilities at a fraction of the cost โ€” while providing full control over your data infrastructure.

Watch

Watch the full talk on YouTube โ†’

Call to Action Background
Free discovery call

Ready to Make Data Work for Your Business?

Join companies that trust iJKos & partners to build reliable data infrastructure and turn complexity into clear, confident decisions.