In web analytics, sampling is often presented as a useful, if not necessary improvement. The general argument is that sampling dramatically improves data processing times.
This concept draws me somewhat between dismay and admiration. Dismay because the slightest quality is justified, and admiration because we are able to portray a confirmed weakness as a strength. One characteristic which is common to all analysts is their critical thinking. Their critical thinking will undoubtedly be engaged in response to the suspicions of excess information that exists on the benefits of sampling: is sampling presented as something that it really isn’t, and if so, why?
I felt that this question deserved to be researched and here are the findings:
- The “Alibi” which is largely “forced” (the virtues of sampling)
- The “Operation”: the reality of sampling (the hidden facts)
- The “Motive”: who uses sampling (who benefits from it) and why?
The myth of virtuous sampling
If we were to believe the many different articles and publications available on sampling, then sampling would benefit the analysts:
|“Sampling improves the speed at which information is processed by reducing the amount of data to be processed.”|
This claim is not debatable when considered on its own.
|“Sampling does not penalise the relevance of reports.”|
This argument has never been explained (and for good reason) but is always implied.
|“It benefits the final user.”|
If the first two claims are said to be true then this final claim will also be true.
There is, however, something else more worrying to think about: by adopting this reasoning an unpleasant conclusion begins to emerge:
|“Service providers who use sampling perform well, whereas the other service providers have not optimised their performances.”|
If we agree with what has been said previously then we naturally arrive at this “conclusion”.
The hidden vices of sampling
Keeping problems hidden is a well known technique: all you need do is hide the flaws and defects.
If we take a closer look, any increase in performance linked to sampling is in fact an increase in a previously reduced performance, that has just gone back up to its original level which it should never have dropped from.
It is a palliative which has been used to correct an obvious weakness. Service providers who do not use sampling perform just as good if not much better than those which use sampling.
Leading global companies are amongst those companies which use sampling.
When we consider the technical means, and the phenomenal processing that such companies possess, the temptation is great to think that if the largest companies have problems, then smaller companies must be in a worse situation.
Let’s check if this is the case or not by studying the example below.
Sampling: for who, why?
I have always said that analyses (irrespective of the domain) cannot be separated from context. To move on from this statement and deal with the issue at hand, I suggest that we review a well-known concept that all logistic co-ordinators and all production unit managers are faced with: returns to scale.
The principle is simple: any increase in production must be able to cope with production capacity thresholds that affect ROI.
|Let’s take the recent example of the Nissan car factory which reduced the production of one of its Oppama lines in Japan from 1.35 to 1.15 million to benefit its factory in Thailand.|
A factory whose production capacity is 1.5 million but which produces 1 million may satisfy a 50% increase in demand at low cost without any real heavy investment in infrastructure. However, if the factory already produces 1.49 million vehicles, a simple 1% increase in demand would force the company to build a new factory, which would in turn dramatically increase the production cost per unit.
- Overcapacity: production can be increased considerably at a very low cost -> high ROI = increasing returns to scale.
- Under capacity: the slightest increase in production will be very expensive very weak, even negative ROI = decreasing returns to scale.
Whenever this critical threshold has been reached there are several different possibilities. Companies might lose money, discourage additional demands, reduce production costs or combine the last two.
Some web analytics providers are confronted with the same problem. The simple increase in organic traffic on their existing customer base (and/or free) implies investing (heavily) in production just to cope with the increase.
Others, on the other hand, anticipate the growth of their production capacity so that they can always be reasonably overcapacity.
The ability of each company to adapt their production capacity (and therefore the potential quality and speed of the service provided) is not associated with their size, but the ratio between the processing power and the volume of traffic to be processed which is only but common sense.
We have just seen that sampling is used by certain companies to perform at the same level as their competitors who do not experience any under-capacity in production, and who therefore do not need sampling to obtain a satisfactory level of speed and performance.
To be more complete, we just need to mention that an operator who is not confronted with this threshold problem may be tempted to use sampling to simply reduce costs, improve performance and reduce the price of the services they have to offer. In the next part we will also see how sampling, even though it meets basic user needs, is more complicated for advanced users who have more specific needs.
The next step involves studying and analysing the consequences of such sampling: are samples really as neutral as they claim to be? Does sampling affect the reliability and relevance of the results?
Do not hesitate to share your comments!