Tutorial Session One – Normal Statistics
The example session with EcoSSe which is described below is
intended as an example run to familiarise the user with the package. This
documented example takes you through the following sequence of analyses:
Ø
Summary statistics and scatterplots of the data
Ø
Scatterplot of the data using transforms
Ø
Fitting a Normal distribution to
the data
Ø
Fitting a mixture of two Normal
distributions
There are many other facilities within the
package, which are given as alternative options on the menus. To start the
tutorial, choose EcoSSe from your Start menu. When you run EcoSSe,
a record is kept of everything you do in that run. The default name for this
file is ghost.lis and the default location
for the file is the folder where your copy of EcoSSe is kept. The first dialog
you will see is:

You may change the name of the file, or accept
the default. Note you must type in the whole name including extension, since no
default extension is offered in this case. For example, if you want to call
your ghost file “myghost.lis” you need to type the
whole name, not just “myghost”.
If you already have a file with this name,
Windows will issue a warning:

Click on
to specify a new name or
to overwrite previous copy of this file.
Your screen should now show something like:

The output above is the opening screen. To
proceed to data analysis, use one of the menus at the top of the Window.

As you can see from the above I have elected to
read in a set of sample data by clicking on the
option and selecting
from the menu which appears. EcoSSe
will remember the last five data files accessed and include these in your
options.
I have selected BROOMSBARN.DAT for my input data file. This is a set of 27 boreholes taken from a
lease area at project (pre-feasibility) stage in the life of a typical

Even if you select a file from the list of
previously analysed data files, EcoSSe will ask you to confirm your choice. This
is actually a quick way of getting back to your working directory, since you
can change your choice at this point. Be warned, though, that if you change
which file you want to read it must be the same type of file – that is, if you
are reading a standard Geostokos data file, you cannot change your mind at this
point and read in a CSV type file.

For this example, we will stick with BROOMSBARN. As your data is read in, it is stored on a working binary file. A
progress bar will indicate how far the process has gone. When data input is
complete, your Window should look like the table above.
The layout of data files is described in detail
in the main EcoSSe
documentation. The routine which has been used shows the first 10 lines of
your data file so that you can check it is going in OK.
When the data has been read in you will see
that the previously "greyed out" or inaccessible options on the main
window toolbar will become activated. You can now select an option. Let us
decide upon a statistical analysis. To do this, click on the
option on the main toolbar.

If you choose the
option, you will display and summarize the
data set and will enable you to get an idea of what the data set looks like in
a simpler form than the full numerical listing.
The screen will switch to a dialog which will
prompt you to choose the two variables for the axes of your graph.
|
|
|
The active screen in the top left hand corner contains
the variables available for analysis in your data file. The bottom right box
shows the variables already chosen (which at this point is none).
The
dialog box shows you that you are expected to
select variables to be the X co-ordinate and the Y co-ordinate for your
scattergram. The upper left dialog box
lists the variable names as they appeared in
the data file, and is prompting you to choose the variable which will be the X
co-ordinate on the graph. For this example, let us choose “K” (potassium) for
the X co-ordinate. You need to check the box next to the “K” option.
|
|
|
Upon selecting the “K” option, a new dialog box
will appear asking you whether you wish to transform the variables to
logarithms or rank transforms. In this case we do not wish to transform so we
click on
. The
dialog disappears and you will be asked for the Y co-ordinate:

I selected the “P” (phosphorus) option by clicking
its check box. The transformation dialog again appears, from which we choose
not to transform the variable by clicking on
.
The lower dialog moves up to the top left and
displays your current working variables.

The
and
buttons have now been activated. If you change
your mind at this point, simply click on the
button and you will be returned to the
original dialogs.
Clicking
will show you your scattergram. The
scattergram is scaled to fit the whole of the display box or area.

Please note that even though you have chosen
'geographical' variables, the scale chosen is for the maximum display size. If
you want points plotted on a 'geographical' scale (same for both axes) you must
use the post-plotting routine which is available elsewhere in EcoSSe.
In the left-hand box of the graphical display,
you will see the summary statistics for both variables
plus the product moment correlation coefficient and the number of samples for
which both variables were available. We can see from this graph that both
variables tend to be “skewed” with a preponderance of lower values and a long
scatter out into higher values.
When the graph is completed, you can select a
new option from the main toolbar. You may wish to plot another graph in which case
you must click on
and select the
option again.
Scattergram 2, using transformations
To illustrate the use of the transformations
for the variables, we draw another graph using the variables on this file
“Log10 K” and “Log10 P”. It is obvious that the person constructing this data
file was aware of the “skewness” of the K and P measurements as illustrated in
the previous scattergram. By adding a column of logarithms, the analyst hopes
to make the scatter more symmetric.
Upon selecting the
option
your screen should show:

EcoSSe will remember your previous selection. Since
you are redefining your variables, you must click on
to redefine your variables. You will again be asked
to select the X co-ordinate and the Y co-ordinate. For the first variable we
simply take logarithms. For the second we add a constant to the variable so
that the transformation actually becomes
For the X co-ordinate check the corresponding
box of the “Log10 K” option.
|
|
|
The transformation dialog again appears, from
which we choose not to transform the variable by clicking on
. For
the vertical axis, choose “log10 P” and no transformation.
|
|
|
Verify your choices as prompted:

A scattergram of these two variables will be
produced, with a table of statistics on the left hand side.

Of course, we could produce a virtually identical plot without using the
extra columns in the data file, by using the logarithmic transform available
within the software. The major difference is that EcoSSe uses
natural logarithms – loge or ln where e=2.718282 – not
logarithms to the base 10. Repeat the above sequence of actions, but selecting
the basic variables and logarithmic transform.
Select the
option.
Your screen should show:

EcoSSe remembers your previous selection. Since you
are redefining your variables, you must click on
to redefine your variables. You will again be
asked to select the X co-ordinate and the Y co-ordinate.
For the X co-ordinate check the corresponding
box of the “K” option and then click on “take natural logarithms” in the
dialog.
|
|
|
EcoSSe also
allows a variation of logarithmic transform which includes an “additive
constant”. If you are interested in this option, please refer to the Tutorial
1A on lognormal statistics. Click on
to confirm that you want logarithmic
transform. You can still cancel this option by clicking on
instead of
. For the vertical axis, choose “P” and
logarithmic transformation.
|
|
|
Note the difference in the names supplied for
your variables. Verify your choices as prompted:

A scattergram of these two variables will be
produced, with a table of statistics on the left hand side.

Note that the overall picture is
identical whether you use log10 or natural logarithms. The values will be
different by a factor of log10(e) or loge(10) {2.302585 or 0.434294} depending on which way round you look at them. The
correlations are identical for both logarithmic transforms.
|
|
|
|
|
Statistics from original
variables K and P |
Statistics from original
variable log10 K and log10 P |
Statistics from logarithms
of original Variables K and P |
Looking
at descriptive statistics and histograms
To illustrate the use of histograms and
descriptive statistics, we use the “K” variable in the BroomsBarn
data set. Select the
menu and click on the option
:
Choose the “measurement to be analysed” in the
same way as previously:
|
|
|
Click in the box next to “K” and the
transformation options will be offered:
|
|
|
For the moment we will make no transformation of
the values, so click on
. As
usual, you will be asked to confirm your choice of variable:

Click on
.
Various summary statistics will be shown in two dialogs:

|
AND
…… |
|
The table in the top left hand corner shows the
usual descriptive statistics, with one small exception. The ‘higher order’
statistics – standard deviation, skewness and kurtosis – are divided by (n-1)
and not n, the number of samples. That is:
|
statistic |
formula |
|
Arithmetic
mean |
= sum of values divided by n |
|
Variance
(square of standard deviation) |
= Sum of each (sample value – average)²/(n-1) |
|
skewness |
= Sum of each (sample value – average)³/(n-1)
divided by standard deviation cubed |
|
kurtosis |
= Sum of each (sample value – average)$/(n-1) divided by standard deviation to the power 4 (or variance squared) |
|
Coeff. Of variation |
=arithmetic mean divided by standard
deviation |
In an ideal universe, where the population would
follow a Normal distribution, the mean and standard deviation (divided by n-1)
of the samples are ‘best’ estimates for the mean and standard deviation of that
population. The skewness statistic would be:
We standardise by the standard deviation cubed,
to remove the original variability of the samples and to obtain a statistic
which actually reflects shape rather than spread. Similarly
with the kurtosis. An ideal Normal distribution has a kurtosis of 3. A
value less than 3 suggests that the shape of the histogram will be flatter than
the ideal
Note:
some software packages subtract the 3 from the
kurtosis statistic, so that negative values may be encountered!
The coefficient of variation is also a (more
empirical) measure of skewness BUT only for positive skewness and
only if the values cannot take negative values. This implies that the statistic
is, for example, useless when using a logarithmic transform where values can be
negative.
Defining the necessary parameters for a histogram
A histogram is a graph which shows how the
values vary amongst our samples. The graph shows value along the horizontal
axis, which should (therefore!) reflect the range of our data values. The vertical
axis is, technically, “frequency density”. This is not the actual number of
samples within a defined interval, but the number divided by the width of the
interval. The difference is pretty academic if your histogram intervals are all
the same size.
The software offers you default parameters for
constructing the histogram, based on a simplistic assumption of basic Normality
of the population. The average value is placed at the centre of the horizontal
axis (values). The number of intervals is calculated as n/10 – or 12 if this
comes out smaller than 12. The width of the intervals is selected to give a
range of around 2 or so standard deviations either side of the average value.
If the first interval falls lower than the lowest sample value, this is
adjusted to be a little more sensible. For the “K” values in the Brooms Barn
data set, the basic statistics convert to histogram parameters as follows:
à 
Of course, there is no guarantee that these
default parameters are at all sensible. For example, a brief inspection of the
Brooms Barn data shows that “K” is only measured in whole numbers. It seems a
little silly, then, to choose a histogram interval of 1.2! If we amend this
width to 1, we will have 43 intervals of 1 added to the lowest interval value
of 14. The highest value shown on the histogram will be 57 – but the data
values go up to 96. For a more sensible interval on this run, choose 2. I have
also adjusted the lowest interval to 13 instead of 14. Our final options look
like:

Accepting these parameters results in a new
menu bar appearing at the top of your screen:
.
Select
and choose
:

This will result in a full screen picture of
the histogram, with the associated statistics still in the top left hand
corner:

We can see that the histogram is skewed towards
the left hand side of the graph, with more values squashed between 12 and 26
than between 26 and 96. This shape gives the positive skewness of just over 2
and a kurtosis around 4 times that of the ideal Normal distribution. If we look
at the logarithms of the sample values, we hope to stretch out the lower end
and squash in the upper end. With this data set, we can look at the column “Log10
K” – alternatively we could use the natural logarithm transform available
within the software.
In any case, we need to start again by selecting
,
then select the
menu and click on the option
. Using the same procedure as before, change from “K” to “log10 K”.
The new summary statistics are shown, as before. Note that the skewness is now
less that 0.4 and the kurtosis is just under 3.6.
|
|
|
The suggested histogram parameters are 43
intervals, starting at 1.2 and with a width of 0.016. The logarithmic (base 10)
values vary from 1.0792 to 1.9823. The default histogram is shown as the first
graph below. Alongside, we show three alternative histograms just changing the
interval width each time.
|
accepting default
histogram parameters |
choosing an interval width
of 0.1 |
|
interval width at 0.05 |
interval width of 0.025 |
Of the four above, the interval width at 0.05
seems to compromise between “lots of intervals, lots of detail” and getting a
real idea of what the shape of the population
might be. If we want to do any statistical inference or estimation, it is the
shape of that population we have to predict. In an ideal world, we would want to
have a
menu, select and click on:

You will be offered a set of choices:
![]()
For now, just click on
. A
new dialog appears showing the mean and standard deviation as estimated from
your sample data:
.
If you click on
, the
software will superimpose a perfect Normal distribution with this mean and
standard deviation (see upper graph on the next page). Note the c² (chi-squared) goodness of fit statistic of 16.72 with 10 degrees of
freedom. Checking with any table of c²
statistics – for example, Table 3 in Practical Geostatistics – shows that, with
10 degrees of freedom, a statistic of 15.99 and over would be encountered 1
time in 10 if this
If you click on
, the
software will find the mean and standard deviation of the Normal distribution
which most closely fits your histogram (see lower graph on the next page). The
software had 5 ‘iterations’ (attempts) at fitting a better model, before coming
up with this solution.

above:
below: best fit (least squares)

The arithmetic mean of the
Bear in mind, that we do not expect an exact
match between data and model – unless we have very many samples in our data
set. To use statistical inference, we assume that the samples we have are drawn
from a much larger population “at random and independently”. If this is true,
the c² goodness of fit statistic should vary around the value of the ‘degrees
of freedom’. A c² goodness of fit statistic which is too small is just as worrying as
one which is too large, suggesting that sampling has been influenced to produce
an idealised histogram.
Another way to illustrate the difference
between data and model is to use a ‘probability plot’. From the model option
bar, select
,
then change the display by using the menu bar:

The display will switch to the following,
remembering the model we have already fitted:

This type of graph was first used in the 1940s
and has a special scale along the horizontal axis. The vertical axis is scaled
to the values of our samples – in this case “log10 K”. The horizontal axis is ‘the
percentage of the sampled values which fell below a given value. This is a “cumulative”
graph rather than an interval one like the histogram. For example, 70% of our
samples have values which lie at or below 1.45: