We have just seen that sampling benefits suppliers, but what benefits do customers have? Companies claim that sampling helps improve performance and speed and we have just seen how. But do they tell us the impact that sampling has on the relevance of results? They don’t tell us anything and this is where the problem lies. Here once again, the language is vague and leads to confusion. Some companies justify their use of sampling by continually stating that it is frequently used in statistics. This argument is false as the amalgam does not mean anything, it is even extremely dangerous.
Sampling: how, and what are the consequences?
In statistics no-one ignores the fact the whenever a population of behavioural data is studied, a sample must be representative. Better still, the rules of inference (weightings and re-sampling techniques to reduce the margin of error) mean that the results obtained by statisticians or researchers from representative samples are subject to corrections at the end.
The sampling technique used will differ according to the statistical method used. We cannot consider the statistical method used to work out the probability of drawing a white or black ball (where the law of large numbers comes into play, where for example the 10th result is comparable with the 143,257th), as being the same as the one used for behavioural data (subject to different requests in time and space).
In addition to this, what happens to the cumulative data? The sample is different each day. Are the cumulative results that are displayed applicable for the month, quarter or year? I would be grateful if someone could explain the relevance of all of this to me.
Let’s take the following example:
My site generates on average 50 million hits per month and 50,000 visits a day. Sampling limits me (for example) to 10 million hits per month and 25,000 visits a day. Two different methods are possible:#1 – Stop data collection once the quota has been reachedA list of examples is given below to illustrate the distortion of the results:
#2 – Declare a percentage to be considered A list of examples is given below to illustrate the distortion of the results:
|
Sampling and the analyst’s job
“Big Data” is a topic which has been widely discussed and we are all aware of how important processing “Big Data” is to analysts. It is good to become separated from the mass of “Small Data” which exists, and which provides important and relevant information so that the right marketing actions can be distributed to the right marketing levers.
The main tool used to detect this important information is segmentation, which is often associated with low granularity (as is the case in photography where weaker granularity means a stronger resolution (and as a result increased sharpness)).
It is clear that sampling, when its limits and inaccuracies have been considered, makes it impossible (or in any random cases even dangerous) to use segmentation. The more specific the segment is, the more inaccurate the sample will be from the beginning.
It is useful to remind the people who want to use the power of retargeting that they must rely on using exhaustive data and not sampled data.
In this case sampling is either a restraint or a trap for the analyst.
The reassuring fact in all of this is that analysts are very critical which allows them to sort out the true from the false, to fill in voluntary omissions, to place arguments in the right order when they want to make people believe that something is what it actually isn’t…in other words that sampling is something good.
If you are interested, please see also this related post: The myth of virtuous sampling in Web analytics (Part 1 of 2)
Comments are closed.