The myth of virtuous sampling in Web analytics (Part 2 of 2)

We have just seen that sampling benefits suppliers, but what benefits do customers have? Companies claim that sampling helps improve performance and speed and we have just seen how. But do they tell us the impact that sampling has on the relevance of results? They don’t tell us anything and this is where the problem lies. Here once again, the language is vague and leads to confusion. Some companies justify their use of sampling by continually stating that it is frequently used in statistics. This argument is false as the amalgam does not mean anything, it is even extremely dangerous.

Sampling: how, and what are the consequences?

In statistics no-one ignores the fact the whenever a population of behavioural data is studied, a sample must be representative. Better still, the rules of inference (weightings and re-sampling techniques to reduce the margin of error) mean that the results obtained by statisticians or researchers from representative samples are subject to corrections at the end.

The sampling technique used will differ according to the statistical method used. We cannot consider the statistical method used to work out the probability of drawing a white or black ball (where the law of large numbers comes into play, where for example the 10^th result is comparable with the 143,257^th), as being the same as the one used for behavioural data (subject to different requests in time and space).

In addition to this, what happens to the cumulative data? The sample is different each day. Are the cumulative results that are displayed applicable for the month, quarter or year? I would be grateful if someone could explain the relevance of all of this to me.

Let’s take the following example:

My site generates on average 50 million hits per month and 50,000 visits a day. Sampling limits me (for example) to 10 million hits per month and 25,000 visits a day.
Two different methods are possible:#1 – Stop data collection once the quota has been reachedA list of examples is given below to illustrate the distortion of the results:

The production department releases updates on Wednesday and Friday at 17:00 (including flash offers).
On Wednesday my “quota” is reached at 18:00 and my updates are only partly taken into consideration
On Friday my “quota” is reached at 16:00 and my updates are not considered at all (even though the Internet user behaviour of visitors to my site at 17:00 is considerably different to those who visit it at 16:00)
My super-sales newsletter is published on Tuesday morning: who can seriously tell me that Tuesday’s sample (reached at 11:00) can be compared with and added to Wednesday or Friday’s sample? What conclusions can then be drawn on the total of the three different populations who make different requests, who are incited by completely different things and who represent a different share of the audience on the reference day?
A study can also be carried out on the total number of cumulative hits for November (10 million hits retained out of 20 million) and December (10 million hits retained out of 100 million). The 20 million hits retained is not very representative out of the total of 110 million. What can be said about the average?

#2 – Declare a percentage to be considered

A list of examples is given below to illustrate the distortion of the results:

My history displays 14 million hits and 360,000 visits
I will ask for 70% of the data to be collected to respect the quota
Seasonal variations: for example if the traffic for the month of December is twice that of any other month, then a quota of 70% is too large. This figure will be reduced to 35%, meaning that data will not be collected once the 35% limit has been reached
If, on the other hand, February is a weak month (half of a normal month) then there is no point in sampling since the real value is less than the quota.
Is the rate adaptable?
Who, in this case, decides on whether sampling should be applied or not? Which bases/rates etc, will sampling be based on?
How can this rate be defined without any previous knowledge of the traffic volume for the period given?

Sampling and the analyst’s job

“Big Data” is a topic which has been widely discussed and we are all aware of how important processing “Big Data” is to analysts. It is good to become separated from the mass of “Small Data” which exists, and which provides important and relevant information so that the right marketing actions can be distributed to the right marketing levers.

The main tool used to detect this important information is segmentation, which is often associated with low granularity (as is the case in photography where weaker granularity means a stronger resolution (and as a result increased sharpness)).

It is clear that sampling, when its limits and inaccuracies have been considered, makes it impossible (or in any random cases even dangerous) to use segmentation. The more specific the segment is, the more inaccurate the sample will be from the beginning.

It is useful to remind the people who want to use the power of retargeting that they must rely on using exhaustive data and not sampled data.

In this case sampling is either a restraint or a trap for the analyst.

The reassuring fact in all of this is that analysts are very critical which allows them to sort out the true from the false, to fill in voluntary omissions, to place arguments in the right order when they want to make people believe that something is what it actually isn’t…in other words that sampling is something good.

If you are interested, please see also this related post: The myth of virtuous sampling in Web analytics (Part 1 of 2)