The Most Misleading Measure of Response Time: Average

JackFr · on Dec 11, 2013

Liked it better when it was called "Programmers Need To Learn Statistics Or I Will Kill Them All" http://zedshaw.com/essays/programmer_stats.html, but still kind of wrong.

If you need to reduce a distribution to a single number, the most informative number is going to be the mean.

I understand their point about the 99th percentile, but consider that it's possible to improve the 99th percentile measure, while increasing the mean and degrading the performance of all but 1% of the users.

The real issue is reducing a distribution to one number.

tomp · on Dec 11, 2013

No, the most appropriate "average" number is the median - the number that is at exactly 50% of the population. At least, that's what most people think about when they hear "average".

For example, most people intuitively perceive the fact that "the majority of drivers consider themselves above average" as human stupidity, however mathematically it makes perfect sense if average == arithmetic mean.

If you are not convinced, consider Bill Gates walking into a room full of college students - suddenly, almost everybody becomes below-average wealthy (if average == mean).

rprospero · on Dec 11, 2013

The median can be just as misleading as the mean, if the outliers matter. Properly played, roulette can have a positive median profit for the player, but the mean will always converge on a loss.

rmc · on Dec 11, 2013

The vast majority of people have an above average number of legs.

tedsanders · on Dec 11, 2013

The median has its problems too. If you only look at the median page load time, you won't even be aware of huge problems like 40% of your visitors with slow internet connections not being able to load your page at all. If huge effects don't move users across the median, the median won't reflect those effects at all.

To me, it's not clear that there is a "most appropriate" number. All mappings of a distribution into a number will throw information away - the trick is to pick a number that keeps the information you want. And that seems to depend on the context.

gngeal · on Dec 11, 2013

most people intuitively perceive the fact that "the majority of drivers consider themselves above average" as human stupidity

I intuitively perceive it as good drivers being clustered together near the top, with a few outstanding lousy drivers near the bottom.

If you are not convinced, consider Bill Gates walking into a room full of college students - suddenly, almost everybody becomes below-average wealthy (if average == mean).

That's true in my country even if Bill Gates doesn't walk in. Hell, perhaps it's true even in the US.

tomp · on Dec 12, 2013

I meant "below room-average wealthy", i.e. taking only the people in the room into the account.

mikeklaas · on Dec 11, 2013

I like 90th percentile as the "one number" to choose. Mean is misleading, and median (50th percentile) is too low to capture what most of your users are experiencing.

gmisra · on Dec 11, 2013

Invert it. What we really care about is the percent of users who are experiencing a higher-than-tolerable response time, right? Why not measure that instead of using descriptive statistics whose goal is primarily to describe the entire distribution? Do we even care about the variance in response time amongst everybody below the threshold?

dsiroker · on Dec 11, 2013

You are exactly right. The goal here isn't to optimally describe the entire distribution but to improve the end-user experience for higher-than-tolerable response times. The difference between 80ms and 100ms isn't noticeable. Between 100ms and 1000ms it certainly is.

bradleyjg · on Dec 11, 2013

> If you need to reduce a distribution to a single number, the most informative number is going to be the mean.

No such rule of thumb is ever going to work, you always need to consider context.

In the case where disutility is non-linear with response time, especially if there is a cliff below which differences are irrelevant, an improvement in the response time to the worst decile may well be worth degraded performance for a majority of users. The most useful single statistic in that case could be, not mean, median, or mode, but the percentage of users who fall above the "unacceptable" delay threshold.

darkxanthos · on Dec 11, 2013

Mean, Median, and Mode are all measures of center. Which is appropriate is completely contextual.

simonster · on Dec 11, 2013

If you need to reduce a distribution to a single number, the most informative number is going to be the mean.

Not necessarily. If you want an general-purpose summary statistic that's easy to interpret, for many distributions, the median is a better statistic than the mean. (The pathological case is the Cauchy distribution, which looks like a normal distribution with heavier tails, but has a median but no mean, and both any individual data point and the sample mean approximate the median of the true distribution equally well.)

But yes, unless you have an a priori reason to think that the response is normal, it is worthwhile to look at quantiles, a histogram, or a kernel smoothed density estimate of the data as opposed to a single statistic.

eterm · on Dec 11, 2013

I'd argue that median would be more useful than mean in this case if you had to pick a single number.

But the key takeaway is to consider distributions and to consider confounding factors.

dllthomas · on Dec 11, 2013

With the proper encoding, I can reduce any distribution you measure to one (sufficiently large) number!

ender7 · on Dec 11, 2013

Using FPS as a measure of your UI's performance is equally problematic. FPS is a great measurement for for games since performance dips usually occur over a span of many frames, but for UIs a lot of work tends to get concentrated into a single frame. A single frame that takes 110ms (or, heaven forbid, 500ms) to render won't move the needle on your FPS meter, but it will be instantly recognizable by the user.

I've complained about this before; use maximum frame delay [0] instead of FPS when measuring UI responsiveness.

[0] The maximum time elapsed between any two sequential frames during your test.

stonemetal · on Dec 11, 2013

Actually max frame delay is a good metric for games too. AMD has had a driver defect for years that caused stutter. It wasn't until their rival, Nvidia, released a max frame delay test tool and rubbed AMD's nose in it that they realised there was a problem.

programminggeek · on Dec 11, 2013

Why not break the numbers down more granularly to like 25% 50% 70% 90% 95% 99% ?

Understanding where your users are on the curve is probably more interesting than a single number. Worrying about that last 1% really only makes a meaningful difference if your user base is huge enough that fixing something for 1% of your users can jump revenue by a significant multiple.

Mentally, I try and think of the 80 or 90% of users with a similar experience, needs, etc. and make it better for them. In this case, speed is good for everybody, but I care very little about the needs of that last 1% if your customers are all paying the same. No sense in putting the needs of a small number of users in front of the needs of a much larger set of users.

jamesaguilar · on Dec 11, 2013

50-90-99 is my go-to set. I really don't give a rip about the ops that are faster than the median. Seeing those is just me patting myself on the back.

krmmalik · on Dec 11, 2013

Big fan of Optimizely here. I've used their product with a handful of my clients. The thing that struck me about the article the most was how well it was written. Very engaging all the way through. That kind of quality of writing, I would say is quite rare.

Anyway, I'm really glad that they've improved the load times for their snippet, because this issue is always a genuine concern that needs resolving.

res0nat0r · on Dec 11, 2013

Anyone at AWS should know the phrase TP99. That is used all of the time to measure the 1% and is something they are very concerned with.

alttab · on Dec 11, 2013

Amazon is all about bringing the best value to the most customers, so even a deviation in mean to bring down the p99 is worth it for most customers. Especially when that correlates directly to sales.

durbatuluk · on Dec 11, 2013

Sadly is hard to say anything without any value on axis. Difference between mean and 99% is 5s or 20ms? As someone said here, threshold for "slow loading" should be used before picking metric for measuring it. Pick graphic one and draw a line where users whine about slow loading, check how many are under and above. If the number of users below threshold value is "greater" then above threshold you shouldn't be so worried. How much greater can be picked from standard error from threshold measure from users.

noelwelsh · on Dec 11, 2013

Regarding mean vs 99% etc. In this case all you care about is: did loading the script delay page rendering to an extent that it was perceptible to the user? It's basically a step function. 99% is appropriate in this case.

Want to do it yourself? This talk by Etsy a few weeks ago has some detail on how they did a similar thing:

http://www.slideshare.net/marcusbarczak/integrating-multiple...

Some links at the end of the talk. Infrastructure wise, I think you have to be prepared to pay for some expensive DNS before this kind of thing is viable.

dsiroker · on Dec 11, 2013

You are exactly right and this presentation is spot on.

josephscott · on Dec 11, 2013

The problem with using average for many performance stats is that it hides issues. There is a great paper on the topic - http://method-r.com/downloads/doc_details/44-thinking-clearl...

It is only about 13 pages, making it a quick but very informative read. I highly recommend it for anyone trying to measure performance, throughput, response time, efficiency, skew and load.

michaelbuckbee · on Dec 11, 2013

I wish they talked more about how they had combined Akamai and Edgecast - seems like a very useful and effective technique.

dsiroker · on Dec 11, 2013

Check out the whitepaper, it has a lot of details about that. :)

Direct link: http://pages.optimizely.com/CDNBalancingWhitepaper_GeneralLP...

tedivm · on Dec 11, 2013

Unfortunately that whitepaper says nothing on how they do it.

My theory is that they're using the Dynect CDN manager. This is a tool created by Dyn to let companies use their DNS system to pick which CDNs are used where, and to create round robin rules that are percentage and region based.

I confirmed this by checking their nameservers, which are currently pointed at Dyn (ie, NS1.P24.DYNECT.NET).

The thing that amuses me about this is that Edgecast offers this same service at a far, far more reasonable price. I'm not sure why they didn't just use them.

daurnimator · on Dec 11, 2013

Care to give a summary so I don't have to give them personal details?

rcsorensen · on Dec 11, 2013

http://pages.optimizely.com/rs/optimizely/images/CDN_Balanci...

""" At the highest level, a “balanced” CDN architecture is one that leverages two or more CDNs hosting identical content to increase (a) the overall number of physical Points of Presence for the network (PoPs) and (b) the proximity of those PoPs to the end users accessing them around the world """

baruch · on Dec 11, 2013

No real info in there. They only say that they mix CDNs based on where the different strength, probably using some DNS tricks of their own based on GeoIP. No real details on how they figured how to split things or how they performed the split itself.

seiji · on Dec 11, 2013

Stop trying to put conversion engagement trickers out of a job. Give them all your personal information so they can send you drip emails over the next six months to sell you things.

_clhx · on Dec 11, 2013

That's not true.

I worked at a place that ONLY cared about the longest response time. Imagine! They ignored everything else!

evincarofautumn · on Dec 12, 2013

Well, bringing down the maximum response time is a good goal. Really, if all your responses are fast enough, then it doesn’t matter if most of them are on the slow end of that range.

amikula · on Dec 11, 2013

I think in Optimizely's case, the most important factor is making sure that there's no statistically significant correlation between higher response times and A/B testing. In other words, if the higher response times result an imbalanced impact on the test, the test is invalid.

dsiroker · on Dec 11, 2013

You are right that if one variation takes longer to load and is noticeable it is likely to cause a lower conversion rate. However with Optimizely the response time (on average and 99th percentile) is the same between control and test variations since our implementation is a single snippet of code regardless of what bucket you are put into.

What we are optimizing here for is end-user experience and minimizing the chance they have a higher-than-tolerable response time regardless of the variation they see.

d4rti · on Dec 11, 2013

I've used ApDex[1] before for giving a better measure of response times for user experience

1:http://apdex.org/

zcarter · on Dec 11, 2013

Step one (always): Look at your data.

Only then should you choose the statistic(s) you 'care' about.

tairizzle · on Dec 11, 2013

This was a very insightful read.

binarymax · on Dec 11, 2013

The most misleading measure of almost everything: Average