Why status codes matter in data delivery
Aug 7, 2018
By Noah Zoschke
Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.
When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.
Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.
This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.
It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.
I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.
Success!
First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.
Terminal problems
Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.
What’s the password?
The most common client errors Segment sees are HTTP 401 Unauthorized
and HTTP 403 Forbidden
on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.
This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.
No comprende
The next most common client error is HTTP 400 Bad Request
on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.
Next steps…
These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.
For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.
For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.
Ephemeral problems
Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.
Volume
The largest class of temporary problems seen by Segment are of the HTTP 429 -- Too Many Requests
class. It’s not hard to imagine why…
Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.
Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.
But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.
Out of service
The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:
HTTP 502 Bad Gateway
HTTP 504 Gateway Timeout
HTTP 500 Internal Server Error
HTTP 503 Service Unavailable
Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.
Unreliable channel
Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:
ENOTFOUND — hostname not found
ECONNREFUSED — connection refused
ECONNRESET — connection reset
ECONNABORTED — connection aborted
EHOSTUNREACH — host unreachable
EAI_AGAIN — DNS lookup timeout
Maybe it’s due to bad host, flaky network, or DNS error.
If at first, you don’t succeed…
As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.
WTF webhooks
Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.
We see every number from 1
to 101
as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?
We see HTTP 418 I'm a teapot
which is in the HTTP spec as an April fools joke.
We see normal HTTP SSL errors like CERT_HAS_EXPIRED
and more esoteric like UNABLE_TO_VERIFY_LEAF_SIGNATURE
and DEPTH_ZERO_SELF_SIGNED_CERT
.
Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!
Conclusion
It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.
Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.
As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.
If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!
For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all
The State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportThe State of Personalization 2023
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.
Get the reportShare article
Recommended articles
How to accelerate time-to-value with a personalized customer onboarding campaign
To help businesses reach time-to-value faster, this blog explores how tools like Twilio Segment can be used to customize onboarding to activate users immediately, optimize engagement with real-time audiences, and utilize NPS for deeper customer insights.
Introducing Segment Community: A central hub to connect, learn, share and innovate
Dive into Segment's vibrant customer community, where you can connect with peers, gain exclusive insights, and elevate your success with expert guidance and resources!
Using ClickHouse to count unique users at scale
By implementing semantic sharding and optimizing filtering and grouping with ClickHouse, we transformed query times from minutes to seconds, ensuring efficient handling of high-volume journeys in production while paving the way for future enhancements.