Software Tutorial
Session One – Normal
Statistics
The example session
with the teaching software, PG2000, which is described below is intended as
an example run to familiarise the user with the package. This documented example
takes you through the following sequence of analyses:
Ø
Summary statistics and
scatterplots of the data
Ø
Scatterplot of the data using
transforms
Ø
Fitting a Normal distribution to
the data
There are many other
facilities within the package, which are given as alternative options on the
menus. To start the tutorial, choose PG2000 from your Start menu or desktop icon
. When you run PG2000, a record is kept of
everything you do in that run. The default name for this file is ghost.lis and
the default location for the file is the folder where your copy of PG2000
is kept. The first dialog you will see is:

You may change the
name of the file, or accept the default. Note you must type in the whole name
including extension, since no default extension is offered in this case. For
example, if you want to call your ghost file “myghost.lis” you need to type the
whole name, not just “myghost”.
If you already have
a file with this name, Windows will issue a warning:

Click on
to
specify a new name or
to
overwrite previous copy of this file.
Your screen should
now show something like:

The output above is
the opening screen. To proceed to data analysis, use one of the menus at the
top of the Window.

As you can see from
the above I have elected to read in a set of sample data by clicking on the
option
and selecting
from
the menu which appears. PG2000 will remember the last five data files
accessed and include these in your options.
I have selected BROOMSBARN.DAT
for my input data file. This is a set of 436 soil samples taken from a farm in
England. The sample data are 40 metres apart. Co-ordinates on the data file are
in grid spacing not metres.

Even if you select a
file from the list of previously analysed data files, PG2000 will ask you to confirm
your choice. This is actually a quick way of getting back to your working
directory, since you can change your choice at this point. Be warned, though,
that if you change which file you want to read it must be the same type of file
– that is, if you are reading a standard Geostokos data file, you cannot change
your mind at this point and read in a GeoEAS type file.

For this example, we
will stick with BROOMSBARN. As your data is read in, it is stored on a
working binary file. A progress bar will indicate how far the process has gone.
When data input is complete, your Window should look like the table above.
The layout of data
files is described in detail in the main PG2000 documentation. The routine which has been
used shows the first 10 lines of your data file so that you can check it is
going in OK.
When the data has
been read in you will see that the previously "greyed out" or
inaccessible options on the main window toolbar will become activated. You can
now select an option. Let us decide upon a statistical analysis. To do this,
click on the
option
on the main toolbar.

If you choose the
option, you will display and summarize the
data set and will enable you to get an idea of what the data set looks like in
a simpler form than the full numerical listing.
The screen will
switch to a dialog which will prompt you to choose the two variables for the
axes of your graph.
|
|
|
The active screen in
the top left hand corner contains the variables available for analysis in your
data file. The bottom right box shows the variables already chosen (which at
this point is none).
The
dialog box shows you that you are expected to
select variables to be the X co-ordinate and the Y co-ordinate for your
scattergram. The upper left dialog box
lists the variable names as they appeared in
the data file, and is prompting you to choose the variable which will be the X co-ordinate
on the graph. For this example, let us choose “K” (potassium) for the X
co-ordinate. You need to check the box next to the “K” option.
|
|
|
Upon selecting the
“K” option, a new dialog box will appear asking you whether you wish to
transform the variables to logarithms or rank transforms. In this case we do
not wish to transform so we click on
. The dialog disappears and you will be asked
for the Y co-ordinate:

I selected the “P”
(phosphorus) option by clicking its check box. The transformation dialog again
appears, from which we choose not to transform the variable by clicking on
.
The lower dialog
moves up to the top left and displays your current working variables.

The
and
buttons have now been activated. If you change
your mind at this point, simply click on the
button
and you will be returned to the original dialogs.
Clicking
will
show you your scattergram. The scattergram is scaled to fit the whole of the
display box or area.

Please note that
even though you have chosen 'geographical' variables, the scale chosen is for
the maximum display size. If you want points plotted on a 'geographical' scale
(same for both axes) you must use the post-plotting routine which is available
elsewhere in PG2000.
In the left-hand box
of the graphical display, you will see the summary statistics for both
variables plus the product moment correlation coefficient and the number of
samples for which both variables were available. We can see from this graph
that both variables tend to be “skewed” with a preponderance of lower values
and a long scatter out into higher values.
When the graph is
completed, you can select a new option from the main toolbar. You may wish to
plot another graph in which case you must click on
and
select the
option
again.
Scattergram 2, using
transformations
To illustrate the
use of the transformations for the variables, we draw another graph using the
variables on this file “Log10 K” and “Log10 P”. It is obvious that the person
constructing this data file was aware of the “skewness” of the K and P
measurements as illustrated in the previous scattergram. By adding a column of
logarithms, the analyst hopes to make the scatter more symmetric.
Upon selecting the
option your screen should show:

PG2000 will remember your previous selection. Since
you are redefining your variables, you must click on
to
redefine your variables. You will again be asked to select the X co-ordinate
and the Y co-ordinate. For the first variable we simply take logarithms. For
the second we add a constant to the variable so that the transformation
actually becomes
For the X
co-ordinate check the corresponding box of the “Log10 K” option.
|
|
|
The transformation
dialog again appears, from which we choose not to transform the variable by
clicking on
. For the vertical axis, choose “log10 P” and
no transformation.
|
|
|
Verify your choices
as prompted:

A scattergram of
these two variables will be produced, with a table of statistics on the left
hand side.

Of course, we could produce a virtually identical plot
without using the extra columns in the data file, by using the logarithmic
transform available within the software. The major difference is that PG2000 uses natural logarithms – loge or ln where e=2.718282 – not logarithms to the base 10. Repeat the above sequence
of actions, but selecting the basic variables and logarithmic transform.
Select the
option. Your screen should show:

PG2000 remembers your previous selection. Since you
are redefining your variables, you must click on
to
redefine your variables. You will again be asked to select the X co-ordinate
and the Y co-ordinate.
For the X
co-ordinate check the corresponding box of the “K” option and then click on
“take natural logarithms” in the
dialog.
|
|
|
PG2000 also
allows a variation of logarithmic transform which includes an “additive
constant”. If you are interested in this option, please refer to the Tutorial
on lognormal statistics. Click on
to
confirm that you want logarithmic transform. You can still cancel this option
by clicking on
instead of
. For
the vertical axis, choose “P” and logarithmic transformation.
|
|
|
Note the difference
in the names supplied for your variables. Verify your choices as prompted:

A scattergram of
these two variables will be produced, with a table of statistics on the left
hand side.

Note
that the overall picture is identical whether you use log10 or natural
logarithms. The values will be different by a factor of log10(e) or loge(10) {2.302585 or 0.434294} depending on
which way round you look at them. The correlations are identical for both
logarithmic transforms.
|
|
|
|
|
Statistics from original variables K and P |
Statistics from original variable log10 K and log10 P |
Statistics from logarithms of original Variables K and P |
Looking at descriptive statistics and
histograms
To illustrate the
use of histograms and descriptive statistics, we use the “K” variable in the
BroomsBarn data set. Select the
menu
and click on the option
:
Choose the
“measurement to be analysed” in the same way as previously:
|
|
|
Click in the box
next to “K” and the transformation options will be offered:
|
|
|
For the moment we
will make no transformation of the values, so click on
. As usual, you will be asked to confirm your
choice of variable:

Click on
. Various summary statistics will be shown in
two dialogs:

|
AND …… |
|
The table in the top
left hand corner shows the usual descriptive statistics, with one small
exception. The ‘higher order’ statistics – standard deviation, skewness and
kurtosis – are divided by (n-1) and not n, the number of samples. That is:
|
statistic |
formula |
|
Arithmetic mean |
= sum of values
divided by n |
|
Variance (square of standard deviation) |
= Sum of each
(sample value – average)²/(n-1) |
|
skewness |
= Sum of each
(sample value – average)³/(n-1) divided by standard deviation cubed |
|
kurtosis |
= Sum of each
(sample value – average)$/(n-1) divided by standard deviation to the
power 4 (or variance
squared) |
|
Coeff. Of variation |
=arithmetic mean
divided by standard deviation |
In an ideal
universe, where the population would follow a Normal distribution, the mean and
standard deviation (divided by n-1) of the samples are ‘best’ estimates for the
mean and standard deviation of that population. The skewness statistic would be:
·
zero for a symmetrical data set
·
positive if there were more samples
in the lower values and a long tail to the high values
·
negative if there the samples are
concentrated in the high values with a long tail to the lower values
We standardise by
the standard deviation cubed, to remove the original variability of the samples
and to obtain a statistic which actually reflects shape rather than spread.
Similarly with the kurtosis. An ideal Normal distribution has a kurtosis of 3.
A value less than 3 suggests that the shape of the histogram will be flatter
than the ideal
Note: some software packages subtract the 3 from the kurtosis statistic, so
that negative values may be encountered!
The coefficient of
variation is also a (more empirical) measure of skewness BUT only for positive
skewness and only if the values cannot take negative values. This implies that
the statistic is, for example, useless when using a logarithmic transform where
values can be negative.
Defining the necessary
parameters for a histogram
A histogram is a
graph which shows how the values vary amongst our samples. The graph shows
value along the horizontal axis, which should (therefore!) reflect the range of
our data values. The vertical axis is, technically, “frequency density”. This
is not the actual number of samples within a defined interval, but the number
divided by the width of the interval. The difference is pretty academic if your
histogram intervals are all the same size.
The software offers
you default parameters for constructing the histogram, based on a simplistic
assumption of basic Normality of the population. The average value is placed at
the centre of the horizontal axis (values). The number of intervals is
calculated as n/10 – or 12 if this comes out smaller than 12. The width of the
intervals is selected to give a range of around 2 or so standard deviations
either side of the average value. If the first interval falls lower than the
lowest sample value, this is adjusted to be a little more sensible. For the “K”
values in the Brooms Barn data set, the basic statistics convert to histogram
parameters as follows:
à 
Of course, there is
no guarantee that these default parameters are at all sensible. For example, a
brief inspection of the Brooms Barn data shows that “K” is only measured in
whole numbers. It seems a little silly, then, to choose a histogram interval of
1.2! If we amend this width to 1, we will have 43 intervals of 1 added to the
lowest interval value of 14. The highest value shown on the histogram will be
57 – but the data values go up to 96. For a more sensible interval on this run,
choose 2. I have also adjusted the lowest interval to 13 instead of 14. Our
final options look like:

Accepting these
parameters results in a new menu bar appearing at the top of your screen:
. Select
and
choose
:

This will result in
a full screen picture of the histogram, with the associated statistics still in
the top left hand corner:

We can see that the
histogram is skewed towards the left hand side of the graph, with more values
squashed between 12 and 26 than between 26 and 96. This shape gives the
positive skewness of just over 2 and a kurtosis around 4 times that of the
ideal Normal distribution. If we look at the logarithms of the sample values,
we hope to stretch out the lower end and squash in the upper end. With this
data set, we can look at the column “Log10 K” – alternatively we could use the
natural logarithm transform available within the software.
In any case, we need
to start again by selecting
, then select the
menu
and click on the option
. Using the same procedure as before, change
from “K” to “log10 K”. The new summary statistics are shown, as before. Note
that the skewness is now less that 0.4 and the kurtosis is just under 3.6.
|
|
|
The suggested
histogram parameters are 43 intervals, starting at 1.2 and with a width of
0.016. The logarithmic (base 10) values vary from 1.0792 to 1.9823. The default
histogram is shown as the first graph below. Alongside, we show three
alternative histograms just changing the interval width each time.
|
accepting
default histogram parameters |
choosing
an interval width of 0.1 |
|
interval
width at 0.05 |
interval
width of 0.025 |
Of the four above,
the interval width at 0.05 seems to compromise between “lots of intervals, lots
of detail” and getting a real idea of what the shape of the population might be. If we want to do
any statistical inference or estimation, it is the shape of that population we
have to predict. In an ideal world, we would want to have a
menu,
select and click on:

You will be offered
a set of choices:
![]()
For now, just click
on
. A new dialog appears showing the mean and
standard deviation as estimated from your sample data:
.
If you click on
, the software will superimpose a perfect
Normal distribution with this mean and standard deviation (see upper graph on
the next page). Note the c² (chi-squared) goodness of fit statistic of
16.72 with 10 degrees of freedom. Checking with any table of c² statistics – for example, Table 3 in
Practical Geostatistics – shows that, with 10 degrees of freedom, a statistic
of 15.99 and over would be encountered 1 time in 10 if this
If you click on
, the software will find the mean and
standard deviation of the Normal distribution which most closely fits your
histogram (see lower graph on the next page). The software had 5 ‘iterations’
(attempts) at fitting a better model, before coming up with this solution.

above:
below: best fit (least squares)

The arithmetic mean
of the
Bear in mind, that
we do not expect an exact match between data and model – unless we have very
many samples in our data set. To use statistical inference, we assume that the
samples we have are drawn from a much larger population “at random and
independently”. If this is true, the c² goodness of fit statistic should vary
around the value of the ‘degrees of freedom’. A c² goodness of fit statistic which is too
small is just as worrying as one which is too large, suggesting that sampling
has been influenced to produce an idealised histogram.
Another way to
illustrate the difference between data and model is to use a ‘probability
plot’. From the model option bar, select
, then change the display by using the menu
bar:

The display will
switch to the following, remembering the model we have already fitted:

This type of graph
was first used in the 1940s and has a special scale along the horizontal axis.
The vertical axis is scaled to the values of our samples – in this case “log10
K”. The horizontal axis is ‘the percentage of the sampled values which fell
below a given value. This is a “cumulative” graph rather than an interval one
like the histogram. For example, 70% of our samples have values which lie at or
below 1.45:

Notice, that if we
read the value from the line (Normal model) rather than the symbol (data) we
get a slightly higher value on the vertical axis. This is the difference
between the model and the data and this is what the software minimises to get
the “best fit”.
One of the
advantages of the probability plot is that we do not have to group our data
into intervals as we do with the histogram. If we have fewer than 500 samples
(software restriction), we can get far more detail in our probability plot by
posting every sample separately. Click on
, then change the display by using the menu
bar:

This option returns
you to the basic statistical summary and the histogram parameter dialog with
all the original defaults:

If you have not
noticed it before, there is a large ‘button’ which says:
.
You can use this for
any data set with less than 500 samples, for larger data sets it is greyed out
and you have to specify histogram intervals. Clicking on this button, returns
us to the menu bar:

Notice that the two
histogram options have greyed out because we did not group the data into
intervals.
Selecting
now
results the graph on the next page. Each symbol now represents one sample,
rather than one histogram interval. Notice the rounding on the original data,
which results in many samples with exactly the same sample value. Notice also
the deviations in the ‘tails’ of the graph:
·
at the lower end the measured values
are a little higher than ideal suggesting that there may be problems with
measuring low concentrations;
·
at the upper end the measured values
are also a little higher than ideal with a noticeable break between 45 and a
little over 50 in the original “K” units. Referring back to our scattergrams,
you can clearly see this blank space in the graph of “K” versus “P”.

It would definitely
be worth reviewing the data set for those samples with “K” over 45 to check why
there is this gap between the rest of the samples and those ones.
Otherwise, the main
body of the data seems to conform nicely to the Normal distribution and we can
be confident that any statistical inference based on Normality assumptions can
be applied to “log10 K”.
You might want to
try repeating this exercise with the “P” values and with “pH”. Be prepared for
surprises with “pH”!
Finishing up
Clicking on the
button
will pass you back to the main menu. To finish this run of the program, select:

Clicking on this
menu item or on
will
end your run with the software. You will see the closing down dialog box:

The above Tutorial
session should serve only to illustrate a possible use of the various routines
from PG2000.
Try running the program again, choosing your own responses. Try reading in one
of the other data files which are provided, say, samples.dat.
General Notes
There are a few
points which you may have noted in following the Tutorial session above. Most
of the routines communicate between themselves, without you having to worry
about getting the right information from one to the other. For example, after
you read in the complete contents of the data file, the routines ask which of
the variables you actually want to analysis. This information is then stored
internally and may be accessed by any of the other routines. When we went from
plotting graphs of one variable against another to fitting a distribution, the
routines knew that you had selected some variables, but that these were
inappropriate for the new analysis. On the other hand, repeating the
scattergram request, the routine suggested that you could continue to use the
same choice of variables. This is a feature of most of PG2000, in that it will recall
what you chose previously and ask whether this is to change or not.
PG2000 does not distinguish between upper and lower
case letters, so you may type in whatever you find most pleasing. When the
program requires a numerical answer, your input will be checked to make sure
that it is actually a number. If you type in any illegal characters and press
ENTER, the checking routine will filter out the unacceptable characters which
you type. It should be noted that, if the routine is expecting a whole number
then a decimal point is unacceptable. Much of the numerical input is checked
for valid values.
A copy of this run
should have been made on a file called GHOST.LIS unless you changed the name at the beginning
of the run. Send this file to your printer if you want a record of the analysis
or look at it with Wordpad or Notepad.
PG2000 — like any computer software — is not
completely error-free. Neither is it fool-proof. You can always get out of the
software by right clicking on the Taskbar. This will invoke the 'End Task'
facility to close the Window without damaging the rest of your system. If you
cannot figure out what went wrong, note down as much information as you can
about the program you were running, the data you were using and exactly where
it broke down. Contact your supplier locally or Geostokos direct for
assistance, software@kriging.com.
Send us the ghost.lis file and (if you can) the data you were analysing at the time.
{a page to scribble on}