How our stack evolved - Datadog
Jan 21, 2020
By Geoffrey Keating
In Elad Gil’s excellent book, High Growth Handbook, he observes that if you’re going through hypergrowth, you will have a different company every 6-12 months.
The same is true of your tech stack.
The tools that worked when you first launched your company inevitably stretch (and sometimes break) as your needs change at scale. When headcount and your userbase are growing rapidly, it’s only natural to explore more specialized additions to your stack. Maybe after writing all your initial code in Python, you find that a polyglot approach is required so you incorporate Go for massive data-processing tasks or Scala for streaming.
It’s not that one is superior to the other. You simply have to pick the right tool for that particular point in time.
This is exactly how the situation played out at Datadog, the cloud monitoring service. Since its founding in 2010, the company has grown to an $11.48 billion market cap and more than 1,500 employees. Datadog processes trillions of data points a day, so it’s essential they have the latest and greatest tech stack to show real-time performance for their customers.
We caught up with Ilan Rabinovitch, VP of Product & Community, for a chat about why selecting the right tools isn’t about finding a one-size-fits-all solution as much as it’s about being able to nimbly adjust to the changing needs of your business — and the evolving nature of the SaaS landscape at large.
Ilan himself has had an interesting career; now entering his fifth year at Datadog, he’s worn multiple hats at the company (including Director of Product Management and Director of Technical Community & Evangelism). He’s also extremely involved in the Linux community, as the co-founder of the Texas Linux Fest and the Conference Chair of the Southern California Linux Expo.
This is the second in a series of real-life stories from leaders who have seen their stacks evolve in extraordinary ways. (In December, we heard from Frame.io's VP of Growth & Analytics, Kyle Gesuelli, about how they evaluate key technology choices.)
This week, Ilan walks us through:
What to consider when choosing whether to build or buy
How to know whether today’s tool will work tomorrow
Datadog’s top criteria for deciding whether to adopt a new solution
How to handle cloud optimization
Check out the interview below.
Geoffrey: You’ve been at Datadog for four or five years now; when you arrived, it was quite a different company to what it is now. What did your tech stack or tooling look like when you arrived?
Ilan: I imagine Datadog's story is fairly similar to Segment's. We built solutions that worked for us at a particular point in time, and then as we've grown, we've ingested massive amounts of data, which has grown exponentially. We're constantly rethinking both implementation and architectural decisions for the scale of today plus months or years. The scale of today might be 10x tomorrow or 100x. So the decisions we make today need to be flexible.
The data stores we picked 10 years ago are not the data stores we're using today. In many cases, we've gotten to the point where we're developing our own proprietary data stores for our time series data or for other contextual data that store customer-submitted telemetry, so we can hone in on the specific characteristics we need.
But the path that almost everything at Datadog takes is that we start with something off the shelf and then as it scales and as the need arises, we customize it, we switch it around, and we turn it into something more specific to our needs.
A conundrum: Should you build your own solution or buy one?
Geoffrey: It's interesting you mentioned that, because what I'm hearing from a lot of people I'm interviewing is this build-versus-buy tension. Some companies philosophically and culturally are very much on the build side, while others who are earlier in their life span – series B or series C – seem to be very much in the buy camp. Does Datadog have a particular stance or opinion on that matter?
Ilan: I don't know if there's an official stance or position. We generally want to pick the best tool for the job. You look at that both in terms of actually solving the customer need or the technical need, but also in terms of: How quickly can we get something to market? How operationally resilient is it?
Often we're going to start with off-the-shelf components. That might be something open-source, that might be something we buy. And then over time we re-evaluate and improve upon it. This is where a lot of where the open-source world is key for us, for our customers and for others in the space. If we pick a technology and it needs to change because it no longer meets our needs anymore, it may be possible for us to just make improvements there.
We've done that with a number of projects, where we ended up making contributions upstream to address our needs. Some examples include Kubernetes, Spark, Vault, and Kafka Connect among others.
In terms of build vs buy, I think we take a similar path to our customers. We build when its core to our business, and buy if it is outside of our core competencies. This allows us to work with experts in those domains, while we focus on building products and solutions that are key to our customers.
Geoffrey: From chatting to others so far in this interview series, there seems to be a couple of key inflection points at which they adopt new tools. For Datadog, are there any key inflection points that have driven the adoption of new technology or tools?
Ilan: There’s less exciting tech, like how we manage our workflows and how we interact as a team. When Datadog first started, we were a small engineering team interacting with each other one-on-one. Over time, we've grown from a dozen engineers to over 1,500 people across the company. Many of the workflows and tools we had come up with for how to track what we were working on together had to change as we’ve grown over the years, like moving from stickies on a literal whiteboard to work tracking systems like Trello or JIRA.
It's not as exciting as talking about how we changed out a messaging system or changed the way we store data in the cloud, but tools or workflows that help you improve productivity and better collaborate across teams have a major impact on what you can deliver as a business. Maybe that's just a reflection of where I focus my time on a day-to-day basis, being in product.
On the tech side, things have changed so significantly, one of the biggest changes for us was around the time we stopped using off-the-shelf data stores and moved to custom-built ones. That’s a good example of something that had a significant impact on our ability to scale the data. That wasn't a light decision we had to make. Nobody wants to be in the business of building a database from scratch, but when you hit scale, you need to do it.
The essential question: Will today’s tools will work tomorrow?
Geoffrey: That brings us nicely into the next question, which is, as a company, how do you predict how a tool you choose today will scale with your company growth over the next 12 months or beyond?
Ilan: We spend a good deal of time looking at our systems, at the technologies our customers are using, and at the data that's coming our way, and then attempt to forecast out what scale will look like over the next few quarters and years. That's hard to predict early on. If you have achieved that hockey-stick growth, you can't just do a linear forecast. But you can stress-test your limits and know where your ceilings are so that you can avoid being caught off guard.
That growth may be more than the organic growth of your customer base; it can also be changes in how your users interact with your product. For example, when our customers started to move from VMs to containers and serverless, that changed the shape of the data they sent us. It meant rethinking both how we interact with our customers’ infrastructure in terms of discovering resources as they changed more dynamically, but also the scale of the data that we were looking at processing and storing and then querying back.
We had to take a step back as a team and think: “Okay, Kubernetes just launched. Docker just launched. If these things grow like we think they're going to grow, what do we need to do as a company to be ready for that?" And then we just start working backward from that.
In the case of containers and orchestrators (like Kubernetes and ECS) though, we saw adoption move a lot more quickly than anyone really expected. You can see that in the container studies that we've published, where we saw usage increasing by 5x year over year. But by paying attention to where our customers were trending through this type of data analysis, we were able to plan ahead and be ready for our customers as they made their own migrations.
Geoffrey: Do you optimize for "boring technology" (to borrow the Dan McKinley phrase) or do you like to experiment with the latest and greatest tools?
It really depends on the situation. If there is a tried and true path for solving a problem, then that’s an ideal path. But as we start to stretch limits we need to look at the cutting edge. A secondary benefit of using the latest and greatest though is how it helps us understand technologies our customers might be adopting.
For example, we were a relatively early adopter of Kubernetes for managing our compute resources. We adopted it primarily for the technical benefits it offered. Simplifying our application deployments lets us abstract the way we interact with compute resources across providers.
We got just as much value from being the first (if not the only) monitoring company that operates our infrastructure on Kubernetes. It gave us a unique opportunity to understand the challenges that our customers would have as they made similar data migrations. We're able to, forgive the pun, dog-food our own solutions around that and quickly iterate and improve based on that internal feedback.
Geoffrey: Are there any specific technologies that were really popular in the early days of Datadog that just didn't work as you scaled?
Ilan: We've seen a change in the languages and the technology used as well. In the early days of Datadog, things were primarily in Python, and Python's great. There are lots of libraries, lots of frameworks. It’s very easy to learn and very easy to adapt as we needed to build new features. But it's not necessarily the best technology for some of our workloads.
So, over time, we've adopted other languages. Many of our backend systems are developed in Go. When we're working with some of our big data and streaming workloads, that's going to be in a language like Scala. That's not to start a debate over which language is best. Instead, we’re just trying to pick the right tool for the job. So a lot of our user-facing web front-ends continue to be in Python. Backend services, APIs, and things that process large amounts of data tend to be in Go. And then anything streaming tends to be in Scala.
Over time, we've become a polyglot company. Looking at our customer base, it's a transition they all make over time, as they build more products and teams.
The way developers test and develop their applications at Datadog has also changed quite a bit. We've moved from a monolith to a much more distributed set of systems over the years. As the number of components grows it becomes hard to run that all on your laptop. So we moved from the developer virtual machine, where you have all of the different pieces that make up a Datadog environment deployed locally, to a more targeted set up – one that lets you test the components you’re working on alongside other pieces that might be running in a more stable development environment. You just can't fit all of this in your own machine anymore.
Geoffrey: Opposite question: As you've seen so much change at Datadog, are there any tools or frameworks that have remained consistent throughout?
Ilan: We continue to operate in the cloud, although we have expanded our deployments to multiple cloud providers. Streaming technologies like Kafka are still the circulatory system of Datadog. Almost every piece of telemetry we ingest at some point passes through Kafka, as it makes its way through our platform for processing or storage. It's just been a very scalable solution for us, and it's a pattern that works quite well, so we've not had anything to replace it yet. We just invest more in it. We’ve shared some of our experiences in scaling it out on our blog and in presentations, as well as released our tools and frameworks for managing it.
Datadog’s number-one criteria for introducing a new technology
Geoffrey: Once you're evaluating a new technology, there are various criteria that you can weigh against one another. What criteria are most important to Datadog?
Ilan: The number one thing is – does it solve a new and unique problem? Are there existing tools and patterns we’ve already adopted elsewhere internally that we can benefit from?
Another thing we have to look at is, what will it take to operate it at scale? We're looking at many trillions of data points per day that come in from our customers, so we need to know that these technologies scale as needed and that they're not going to create a significant amount of operational toil for our teams. And if not, is there tooling we could build to make that be the case? Those are the two key areas we’d look at.
Of course, the next question is cost. When I say scale, I don’t just mean can it handle requests. It’s about whether it can handle them efficiently in a way where we'll stay within the budgets and parameters we've set for that particular server or service.
Geoffrey: Our co-founder, Calvin, recently wrote about how we trimmed $10 million from our AWS budget. How are you thinking about keeping costs down from particular services?
Ilan: Our primary focus is availability and performance for our customers, given how key we are to how our customers operate their own environments. But finding ways to meet those service level targets in a more efficient manner is, of course, preferred. In terms of infrastructure costs and management, we have a team that focuses on that specifically. They help teams understand their spending and where we should focus on optimization, whether that be in code or in relationships with our providers.
When Calvin’s article came out, we compared it to our own workflows and tooling and saw quite a few similarities. We’ve implemented them with slightly different technologies and workflows, but the same core ideas. For example, I think cross AZ traffic and the costs associated with that were a big focus for Segment’s teams. We've taken similar approaches within our stack by improving AZ affinity and data locality, by using Network Performance Monitoring to identify inefficient configurations.
Optimizing the cloud – without being wasteful
Geoffrey: I'd love to hear more about this cloud optimization team that you have. Is that something unique to Datadog, or is that something that you see in other companies as well?
Ilan: I don't think it's unique to Datadog. I’m increasingly seeing companies build teams that focus on their cloud or infrastructure economics. And while there's a whole industry of software providers that offer tools to make cost management easier, there has still been a noticeable trend to build teams that focus on optimization. The titles vary from company to company, but the focus tends to be the same.
We all pride ourselves on being efficient and performant. Nobody wants to create waste intentionally, but it’s very easy for any individual team to look at their usage and just say, “It makes sense for me to spin up one more server or a hundred more servers to get my job done right now”. That’s the value of the cloud after all. But it is just as easy to forget about that workload and leave it running when you’re done. Or given the pace of innovation at cloud providers, to miss a new development that could reduce costs and improve performance.
So having somebody whose job, day in and day out, is to understand the economics of your infrastructure and work with service teams on where to optimize and share lessons between those teams turns out to be quite valuable.
Geoffrey: Beyond the optimization of existing tools, is there a team responsible for evolving that stack and bringing in new technology?
Ilan: There's not a specific team that's responsible for that. Each team is going to own their part of our technical stack, and they're usually organized around a particular set of features or user experiences that they're working on. There might be an alerting team or an authentication team or a storage team. Each of those needs to be thinking about and evolving their own tools and use cases, and they're going to partner with somebody in our product team or in our leadership team who’s going to drive them to move towards problems to solve rather than just always solving the immediate quick thing in front of them.
I don't think it could really be one person's job; it has to be something that we share across the organization.
Geoffrey: To wrap up, what new technologies are you most excited about introducing at Datadog?
Over the last year, we’ve begun to use eBPF as a way to better instrument and monitor lower-level parts of the operating system in a low impact manner, and we’re quite excited about the new capabilities it’s offering us. For example, the network performance monitoring product we mentioned earlier relies on eBPF to collect network statistics. Network data is just the start, and I think you will start to see this expand to cover other low-level system stats.
We’ve also had teams experimenting with using it internally for more efficient approaches network routing, and for security use cases. Watch this space.
The State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportThe State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportShare article
Recommended articles
How to accelerate time-to-value with a personalized customer onboarding campaign
To help businesses reach time-to-value faster, this blog explores how tools like Twilio Segment can be used to customize onboarding to activate users immediately, optimize engagement with real-time audiences, and utilize NPS for deeper customer insights.
Introducing Segment Community: A central hub to connect, learn, share and innovate
Dive into Segment's vibrant customer community, where you can connect with peers, gain exclusive insights, and elevate your success with expert guidance and resources!
Using ClickHouse to count unique users at scale
By implementing semantic sharding and optimizing filtering and grouping with ClickHouse, we transformed query times from minutes to seconds, ensuring efficient handling of high-volume journeys in production while paving the way for future enhancements.