Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.
When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.
Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.
This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.
It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.
I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.
First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.
Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.
What’s the password?
The most common client errors Segment sees are
HTTP 401 Unauthorized and
HTTP 403 Forbidden on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.
This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.
The next most common client error is
HTTP 400 Bad Request on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.
These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.
For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.
For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.
Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.
The largest class of temporary problems seen by Segment are of the
HTTP 429 -- Too Many Requests class. It’s not hard to imagine why…
Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.
Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.
But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.
Out of service
The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:
HTTP 502 Bad Gateway
HTTP 504 Gateway Timeout
HTTP 500 Internal Server Error
HTTP 503 Service Unavailable
Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.
Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:
ENOTFOUND — hostname not found
ECONNREFUSED — connection refused
ECONNRESET — connection reset
ECONNABORTED — connection aborted
EHOSTUNREACH — host unreachable
EAI_AGAIN — DNS lookup timeout
Maybe it’s due to bad host, flaky network, or DNS error.
If at first, you don’t succeed…
As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.
Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.
We see every number from
101 as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?
HTTP 418 I'm a teapot which is in the HTTP spec as an April fools joke.
We see normal HTTP SSL errors like
CERT_HAS_EXPIRED and more esoteric like
Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!
It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.
Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.
As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.
If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!
For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all
How to collaborate across marketing & engineering teams when purchasing new technology
Learn how to align and collaborate across marketing and engineering teams, especially when it comes to launching new marketing software that benefits them both.
AI + Personalization: 5 Ways to Use It
Companies that use AI and a CDP can create strong, personalized campaigns that are unique to their customers. This blog explores 5 use cases, complete with examples.
Four recent GDPR changes your business needs to know about
We cover GDPR changes: AI's impact, updated cookie banners, cross-border enforcement law, EU-U.S. Data Privacy Framework; Twilio Segment's role in GDPR compliance through consent, PII protection, local data processing.