Lesson 2: The Scope of Statistics
The main activities of a statistician are to describe and to compare populations according to measured parameters (i.e. body weight, number of children per family, glycaemia, income level, red cell count, etc), but also to plan studies. Most often the data is obtained from a sample or sub-group of the population. The extrapolation (application) of this data to a large population may be hazardous. This step, called inference or induction is a key aspect of statistics. Considering these concerns is crucial in order to master them prior to making medical or financial decisions (e.g. authorisation for the use of a new drug or its reimbursement by a health insurer). The level of confidence in the results of the studies is especially important in the current perspective of Evidence Based Medicine.
This aspect is the most familiar and the most derided aspect of statistics. Due to often heard statistics related to number of children per household such as 1.83 children per family, the statistician may be perceived as someone who is out of touch with the real world. In fact, knowing if this demographic index is 1.83 rather than 1.55 is of primary importance in terms of health policies. In a country of 23 million families, this difference corresponds to 42.1 million vs 35.7 million children, i.e. 6.4 million children for whom immunisation programmes, school attendance, etc have to be planned. The statistical description of a population is not restricted to the estimation of its mean. It considers the whole population, in terms of main trends (e.g. mean, median) but also in terms of variability.
In statistics, a parameter is a characteristic of the distribution of a population for example, the average (the mean) weight of subway users (a population) in a particular city.
Let’s say that in order to equip subway stations with lifts, the transportation company, Metroda, needs to know the weight of the subway’s users. Using electronic scales integrated in the floor of passage ways, 400 measurements are recorded.
The statisticians provided the result:
Weight = 74.1 ± 18.1 kg
Mean (M or m)
The first estimation of a population is the mean, which is here 74.1kg. The mean is calculated as the sum of all values divided by the number of values (usually called n) i.e. for 400 measurements recorded, n= 400.
Standard Deviation (SD)
The second estimation ± 18.1 kg is the Standard Deviation (SD).
SDs do not indicate the range of the observed values (i.e. from the lightest of the 400 subway users (34kg) to the heaviest (136kg) in this population). It is a measurement of the amount of dispersion or variation (variability) of the parameter (in the subway example, the parameter is weight) among individuals tested. The more heterogeneous the population, the larger the SD.
It applies symmetrically on each side of the mean as suggested by the sign Â±, plus or minus, between the mean and the SD.
In addition to the figures, the statisticians provided a histogram representing the distribution of the weight of the subway users.
For each 10kg-class, it shows the corresponding number of individuals, e.g. weight of 56 individuals was in the 50-60kg range,
88 individuals in 70-80kg range, etc.
This overall profile corresponds with the famous: Normal, or Gaussian Distribution
Initially described by the German mathematician Carl Friedriech Gauss, this mode of distribution is typical of situations resulting from numerous influences (i.e. with multifactorial determinants), frequently observed in the living world.
The Multiple Origins of the Biological Variability
A Gaussian distribution of a parameter results from multifactorial determinants.
Among all these determinants, there is a low probability that, for a given individual, all lead to a high or to a low value, which explains the limited number of individuals in the extreme classes of weight. In contrast, many combinations of determinants may result in medium values, leading to a high number of individuals in the middle of the distribution. For these reasons, the biologic parameters, having multi-parametric determinants follow this bell-shaped distribution, the Gaussian or Normal Distribution.
The more the determinants, the more Gaussian the distribution:
What is a Statistical Model?
Let’s have a closer look at the Gaussian or bell curve which, despite its symmetrical form, is the result of many random characteristics. The study of these characteristics resulted in a tool box that contains methods used to produce the various aspects of the curve. The first tool forms the distribution of the individuals around the mean value by using a Standard Deviation or SD as unit.
In a Gaussian distribution:
- 68% of individuals are in the dark orange interval [mean 1SD to mean + 1SD], i.e. here [56kg – 92.2 kg]
- 16% of individuals are over the (mean + 1 SD) limit: i.e. here >92.2 kg (yellow surfaces)
- approximately 2.5% of individuals are over the (mean + 2 SD) limit, i.e. here >110.3 kg (light yellow surface)
- approximately 2.5% of individuals are above the (mean-2 SD)limit , i.e. here <37.9kg (pink surface)
The formula to calculate a standard deviation is:
N.B. Here, the term normal distribution, describe a population but does not discriminate between normal versus pathological values.
Basic vocabulary: Words you should know
Average – Mean – Median – Mode – Standard Deviation
Technical Vocabulary: Important words to know
Biostatistics – Distribution – Normal distribution – Statistics – Variance
Advanced Vocabulary: Useful words to know
Binary variable – Continuous variable -Dichotomous variable – Geometric mean – Harmonic mean – Percentile – Poisson distribution – Quartile
Throwing the dice
The aim of this section is to become more familiar with some practical aspects of distributions, i.e. description of a set of data. Let’s use dice to generate data by throwing them and noting the result. Each value (from 1 to 6) will be collected to build a histogram by filling in corresponding rectangles (on a squared sheet or on a screen copy of the figure on the left). If the first value is a 3, we can colour in the figure on the left in yellow. We will then repeat this action 100 times, and compare the results below.
By throwing the dice 100 times, we might expect to obtain roughly the same number of 1s, 2s, 3s, 4s, 5s or 6s. In other terms, one chance in six (16.67%) for each column. However, as you can see, experimental values are not so regular. In our example, we observe differences from 11 (for 6â) to 23 for (2). Of course, the results are different with your own values, but the dispersion of values is probably similar. This experiment illustrates the role that randomness plays in results. In fact, even though it is rather tedious to perform, a collection of 100 values is only a limited sample.
By throwing the dice [800 times and 2400 times] we obtained the following histograms:
The differences in observations versus the theoretical 16.67% narrow with larger samples of data. As you can see, the more data collected, the flatter the distribution. The large amount of data reduces the randomness of the distribution. The difference in height of six columns is smaller when the number of throws of the dice is increased from 100 to 800 and from 800 to 2 400. This type of distribution, called uniform distribution, is rarely observed for biologic parameters
Closer to a Gaussian distribution
This time, instead of throwing one dice 100 times, we will throw 6 dice 100 times to mimic the influence of six different determinants for a simulated biological parameter in a given person. The total of a 6 dice-throw may vary from 6 (six ones) to 36 (six sixes). In order to simplify the histogram, the values will be grouped (6,7 or 8 in the first columns, 10, 11, 12 in the second, etc.)
Biologic parameters usually depend on numerous factors that can be genetic or environmental.
- The first dice represents the maternal grandmother heritage of the person
- The second one represents the maternal grandfather heritage
- Another, the paternal grandmother heritage
- Another, the paternal grandfather heritage
- Another, the age of the person
- And the last one, the person’s nutritional habits
So let’s throw 6 dice and add their values (in our example the total is 19) and colour one rectangle in the corresponding column (here the 18 to 20 as shown on the figure above in yellow). If we repeat 100 throws and build our own histogram, it would look like this:
Compared to the continuous distribution obtained with one dice, the distribution generated with 6 dice is closer to the bell-shaped Normal distribution. You will notice that extreme values are viewed less often or not at all. For example, we never threw 6 ones or 6 sixes! The probability for observing a total in the 6-7-8 or 34-35-36 range is less than 1/2500. On the contrary, central values are frequently observed because many combinations may lead to them. Of course, this experimental distribution is not a perfect Gaussian distribution. Nevertheless, it shows that a very rudimentary device, far from the biological diversity, tends to generate a Gaussian distribution. This example illustrates that there is almost universally a Normal distribution in the complex living world.
Documents for further reading
Statistics & Medicine
A brief history of medicine and statistics Dan Meyer, Cambridge University Press Essential Evidence-Based Medicine
What does average mean?
Mean, Median, Mode, and Range PurpleMath Lessons and tutoring
Standard Deviation and Variance Mathsisfun
Johann Carl Friedrich Gauss
Biography J J O’Connor and E F Robertson, The MacTutor History of Mathematics archive School of Mathematics and Statistics University of St Andrews Scotland