If you need to reduce a distribution to a single number, the most informative number is going to be the mean.
I understand their point about the 99th percentile, but consider that it's possible to improve the 99th percentile measure, while increasing the mean and degrading the performance of all but 1% of the users.
The real issue is reducing a distribution to one number.
No, the most appropriate "average" number is the median - the number that is at exactly 50% of the population. At least, that's what most people think about when they hear "average".
For example, most people intuitively perceive the fact that "the majority of drivers consider themselves above average" as human stupidity, however mathematically it makes perfect sense if average == arithmetic mean.
If you are not convinced, consider Bill Gates walking into a room full of college students - suddenly, almost everybody becomes below-average wealthy (if average == mean).
The median can be just as misleading as the mean, if the outliers matter. Properly played, roulette can have a positive median profit for the player, but the mean will always converge on a loss.
The median has its problems too. If you only look at the median page load time, you won't even be aware of huge problems like 40% of your visitors with slow internet connections not being able to load your page at all. If huge effects don't move users across the median, the median won't reflect those effects at all.
To me, it's not clear that there is a "most appropriate" number. All mappings of a distribution into a number will throw information away - the trick is to pick a number that keeps the information you want. And that seems to depend on the context.
most people intuitively perceive the fact that "the majority of drivers consider themselves above average" as human stupidity
I intuitively perceive it as good drivers being clustered together near the top, with a few outstanding lousy drivers near the bottom.
If you are not convinced, consider Bill Gates walking into a room full of college students - suddenly, almost everybody becomes below-average wealthy (if average == mean).
That's true in my country even if Bill Gates doesn't walk in. Hell, perhaps it's true even in the US.
I like 90th percentile as the "one number" to choose. Mean is misleading, and median (50th percentile) is too low to capture what most of your users are experiencing.
Invert it. What we really care about is the percent of users who are experiencing a higher-than-tolerable response time, right? Why not measure that instead of using descriptive statistics whose goal is primarily to describe the entire distribution? Do we even care about the variance in response time amongst everybody below the threshold?
You are exactly right. The goal here isn't to optimally describe the entire distribution but to improve the end-user experience for higher-than-tolerable response times. The difference between 80ms and 100ms isn't noticeable. Between 100ms and 1000ms it certainly is.
> If you need to reduce a distribution to a single number, the most informative number is going to be the mean.
No such rule of thumb is ever going to work, you always need to consider context.
In the case where disutility is non-linear with response time, especially if there is a cliff below which differences are irrelevant, an improvement in the response time to the worst decile may well be worth degraded performance for a majority of users. The most useful single statistic in that case could be, not mean, median, or mode, but the percentage of users who fall above the "unacceptable" delay threshold.
If you need to reduce a distribution to a single number, the most informative number is going to be the mean.
Not necessarily. If you want an general-purpose summary statistic that's easy to interpret, for many distributions, the median is a better statistic than the mean. (The pathological case is the Cauchy distribution, which looks like a normal distribution with heavier tails, but has a median but no mean, and both any individual data point and the sample mean approximate the median of the true distribution equally well.)
But yes, unless you have an a priori reason to think that the response is normal, it is worthwhile to look at quantiles, a histogram, or a kernel smoothed density estimate of the data as opposed to a single statistic.
Using FPS as a measure of your UI's performance is equally problematic. FPS is a great measurement for for games since performance dips usually occur over a span of many frames, but for UIs a lot of work tends to get concentrated into a single frame. A single frame that takes 110ms (or, heaven forbid, 500ms) to render won't move the needle on your FPS meter, but it will be instantly recognizable by the user.
I've complained about this before; use maximum frame delay [0] instead of FPS when measuring UI responsiveness.
[0] The maximum time elapsed between any two sequential frames during your test.
Actually max frame delay is a good metric for games too. AMD has had a driver defect for years that caused stutter. It wasn't until their rival, Nvidia, released a max frame delay test tool and rubbed AMD's nose in it that they realised there was a problem.
Why not break the numbers down more granularly to like 25% 50% 70% 90% 95% 99% ?
Understanding where your users are on the curve is probably more interesting than a single number. Worrying about that last 1% really only makes a meaningful difference if your user base is huge enough that fixing something for 1% of your users can jump revenue by a significant multiple.
Mentally, I try and think of the 80 or 90% of users with a similar experience, needs, etc. and make it better for them. In this case, speed is good for everybody, but I care very little about the needs of that last 1% if your customers are all paying the same. No sense in putting the needs of a small number of users in front of the needs of a much larger set of users.
Big fan of Optimizely here. I've used their product with a handful of my clients.
The thing that struck me about the article the most was how well it was written. Very engaging all the way through. That kind of quality of writing, I would say is quite rare.
Anyway, I'm really glad that they've improved the load times for their snippet, because this issue is always a genuine concern that needs resolving.
Amazon is all about bringing the best value to the most customers, so even a deviation in mean to bring down the p99 is worth it for most customers. Especially when that correlates directly to sales.
Sadly is hard to say anything without any value on axis. Difference between mean and 99% is 5s or 20ms?
As someone said here, threshold for "slow loading" should be used before picking metric for measuring it.
Pick graphic one and draw a line where users whine about slow loading, check how many are under and above. If the number of users below threshold value is "greater" then above threshold you shouldn't be so worried. How much greater can be picked from standard error from threshold measure from users.
Regarding mean vs 99% etc. In this case all you care about is: did loading the script delay page rendering to an extent that it was perceptible to the user? It's basically a step function. 99% is appropriate in this case.
Want to do it yourself? This talk by Etsy a few weeks ago has some detail on how they did a similar thing:
Some links at the end of the talk. Infrastructure wise, I think you have to be prepared to pay for some expensive DNS before this kind of thing is viable.
It is only about 13 pages, making it a quick but very informative read. I highly recommend it for anyone trying to measure performance, throughput, response time, efficiency, skew and load.
Unfortunately that whitepaper says nothing on how they do it.
My theory is that they're using the Dynect CDN manager. This is a tool created by Dyn to let companies use their DNS system to pick which CDNs are used where, and to create round robin rules that are percentage and region based.
I confirmed this by checking their nameservers, which are currently pointed at Dyn (ie, NS1.P24.DYNECT.NET).
The thing that amuses me about this is that Edgecast offers this same service at a far, far more reasonable price. I'm not sure why they didn't just use them.
"""
At the highest level, a “balanced” CDN architecture is one that leverages two or more CDNs hosting identical content to increase
(a) the overall number of physical Points of Presence for the
network (PoPs) and
(b) the proximity of those PoPs to the end users accessing them around the world
"""
No real info in there. They only say that they mix CDNs based on where the different strength, probably using some DNS tricks of their own based on GeoIP. No real details on how they figured how to split things or how they performed the split itself.
Stop trying to put conversion engagement trickers out of a job. Give them all your personal information so they can send you drip emails over the next six months to sell you things.
Well, bringing down the maximum response time is a good goal. Really, if all your responses are fast enough, then it doesn’t matter if most of them are on the slow end of that range.
I think in Optimizely's case, the most important factor is making sure that there's no statistically significant correlation between higher response times and A/B testing. In other words, if the higher response times result an imbalanced impact on the test, the test is invalid.
You are right that if one variation takes longer to load and is noticeable it is likely to cause a lower conversion rate. However with Optimizely the response time (on average and 99th percentile) is the same between control and test variations since our implementation is a single snippet of code regardless of what bucket you are put into.
What we are optimizing here for is end-user experience and minimizing the chance they have a higher-than-tolerable response time regardless of the variation they see.
If you need to reduce a distribution to a single number, the most informative number is going to be the mean.
I understand their point about the 99th percentile, but consider that it's possible to improve the 99th percentile measure, while increasing the mean and degrading the performance of all but 1% of the users.
The real issue is reducing a distribution to one number.