Latency percentiles are not additive

Chao Zhang
3 min readOct 7, 2023
Generated from https://mage.space/

When we are estimating the latency spanning multiple requests on the backend, we tend to use napkin math to add 90th percentile (p90) of those requests and use their summation as the estimation. However, the summation is often too pessimistic and follow me to understand why that is the case.

In this simplified example, we have a UI feature displayed on the client after request A and request B are completed. Request B depends on request A, so they are sequential.

The illustration of perceived latency with Request A and Request B

We have backend latency p90(Request A) = 910ms, and p90(Request B) = 910ms, so we conclude the feature latency to be p90(Feature) = p90(Request A) + p90(Request B) = 1820ms. The above math is common in the latency analysis of our engineering requirement document. What did we miss here? A few points:

  1. Client logic: Client code does have non-trivial complexity. Regardless of iOS or Android, any caching layer, data transformation layer and view layer could add latency at the magnitude of 100ms or more.
  2. Network latency: The device needs to be connected to Home Wifi or 5G network, and then hit the nodes through ISP, CDN, Data Center and eventually reached our application server. These are perceived by the end user but very difficult to measure.
  3. Latency percentiles are not additive: p90(Request A and B) p90(Request A) + p90(Request B)

Let’s assume our feature has 10 end users from Alice to Jessica. And let’s ignore client logic and estimate the network latency to be 128ms (Average of 2 RTT from US West to US East). In the most pessimistic case (napkin math pitfall!), Alice has the lowest backend latency for both requests and Jessica has the highest latency for both requests. p90(Request A and B) is 1948ms = p90(Request A) + p90(Request B) + avg(network latency) = 910ms + 910ms + 128ms.

In the most optimistic though, Alice has the lowest backend latency for request A but the highest backend latency for request B, while Jessica has the highest backend latency for request Abut the lowest backend latency for request B. Everyone would end up with the same perceived latency 1228ms. Thus p90(Request A and B) < p90(Request A) + p90(Request B). Voila!

The real world scenario has way more nuance. Request B may reuse the cached result of Request A, or Request B may be skipped if Request A is invalid. The distribution of Request A and Request B latency can also be of different shapes, thus emphasizing the need to accurately measure the perceived latency from the client side.

Takeaway

Latency percentiles are simply not additive. Adding latency percentiles from multiple requests are indicative but not conclusive. And their summation is often too pessimistic and may trigger unnecessary overreaction.

Compared to calculating the latency on paper, we need to pragmatically shift our focus to ensuring the observability of the system. Once the service is built and rolled out to production, measuring and tracing becomes the key.

--

--