Tutorial for preliminary spatial analysis using Practical Geostatistics 2000 teaching software

Back to list of tutorials and other good stuff

Software Tutorial --- Spatial Visualisation

The example session with PG2000 which is described below is intended as an example run to familiarise the user with the package. This documented example illustrates one possible set of analyses which may be carried out. One of the most neglected aspects of statistical analysis --- especially of spatial data --- is the purely visual assessment of the sample data. It takes you through the following sequence of analyses:

{#} Reading in a data file
{#} Post plotting the sample data
{#} Finding the nearest neighbour distribution for inter-sample distance
{#} Inverse distance interpolation mapping

There are many other facilities within the package, which are given as alternative options on the menus. To start the tutorial, choose PG2000 from your Start menu. See Tutorial One for notes on starting an PG2000 run and specifying the ghost file.

Reading in a data file

As you can see from the above I have elected to read in a set of sample data by clicking on the option and selecting from the menu which appears. PG2000 will remember the last five data files accessed and include these in your options. Three input file types can be read in. I will read in a standard Geostokos data file.

I will select WOLFCAMP.DAT for my input data file. This is a set of 85 samples of hydrogeological data taken from the Wolfcamp aquifer in Northwestern Texas. The co-ordinates are in miles and the other measured variable is the Potentiometric pressure (or head) within boreholes intersecting the aquifer. The units of the variable are in feet above sea level.

The routine which reads in the data shows the first 10 lines of your data file so that you can check it is going in OK. The routine also checks whether we actually had the correct number of samples on the file and informs you if there is any discrepancy.

Even if you select a file from the list of previously analysed data files, PG2000 will ask you to confirm your choice. This is actually a quick way of getting back to your working directory, since you can change your choice at this point. Be warned, though, that if you change which file you want to read it must be the same type of file – that is, if you are reading a standard Geostokos data file, you cannot change your mind at this point and read in a CSV type file.

readat_dialog

For this example, we will stick with WOLFCAMP. As your data is read in, it is stored on a working binary file. A progress bar will indicate how far the process has gone. When data input is complete, your Window should look like the table above.

The routine which has been used shows the first 10 lines of your data file so that you can check it is going in OK.

Displaying the data

When the data has been read in, you will see that the "greyed out" options on the main menu bar will be activated. We use the menu bar to select an option, say:

This time we have chosen to display and summarise the data set in a spatial sense. A post plot is a map showing the locations of the samples. Each sample will be coloured and shaded according to the value of a selected variable. Since we are analysing the wolfcamp data, choice of variables should prove fairly simple!

The screen will prompt you to choose the three variables for the analysis. You will see two dialog boxes: the one in the top left hand corner lists the variables available for analysis in your data file; the bottom right box shows the variables already chosen (at this point, none!).

The routine, needs to have information on the position of the samples and on the value at each sample location. This particular data file only contains three variables. However, PG2000 does not know (as yet) which of these variables is which.

There is a lot of information on the screen. At the bottom of the Window, you see the "status bar" which shows the name of the current data file and the title read from that file. The "already chosen" dialog box shows you that you are expected to select variables to be the "X (east/west) co-ordinate", "Y (north/south) co-ordinate" and "Measurement to be analysed" for your semi-variogram. The upper left dialog box lists the variable names as they appeared in the data file and is prompting you to choose the variable which will be the "X co-ordinate" on the graph. For this example, let us choose Easting for the X co-ordinate:

wolfcamp_X

We may then choose "Northing" for the Y co-ordinate:

wolfcamp_Y

Finally, we must choose the variable to be analysed and state any relevant transformations to be made. For this data we require no transformation of the variable "Potentiometric Level", so click on .

wolfcamp_value transform_dialog

The dialog now shows the complete set of chosen variables and has moved to the upper left corner. You have the option to change your mind here by clicking on .

acceptable_variables

This choice of variables is acceptable, so click on to proceed. This may seem tedious to you at the moment, but (later) try running the program with another set of data with more variables. Or try a data set where the columns are in a different order. The PG2000 input routine has been written to allow you this flexibility in building your data files.

The software will suggest contour levels for the shaded plot.

contour_levels

These may be altered as you desire. Click on to proceed with the mapping. To plot a map, you have to specify the area which you wish to be mapped. PG2000will offer a default rectangular area which covers all of the sample locations. You may accept this default or you may prefer a different rectangle or an irregularly shaped area. The last choice can only be made if you have already stored the boundary of the area as a set of vertices of a polygon. The current version of PG2000 can handle polygonal boundaries with up to 500 vertices.

default_boundary

Clicking in the polygonal "radio button" will cause the software to ask you for a file containing the boundary information. You may have up to 500 vertices stored on a file. they may be stored either clockwise or anti-clockwise and the polygon does not have to be "closed". The default name for the boundary line file is that of the original data file plus the extension BLN -- Boundary LiNe.

Accepting this boundary returns us to the area definition dialog.

wolfcamp_boundary

You may respecify the minimum and maximum X and Y values at this point. The defaults given in the dialog are the full extent of the polygonal boundary. However, there are times when you might want to have the estimated grid points on some regular grid starting at a standardised value. For example, changing "Minimum Y value" to 0 would mean that the bottom left hand corner of the grid used would be at X = –135, Y = 0. Click on to accept this boundary definition.

If you only wanted to see a subsection of the data (equivalent to zooming in) you could specify a boundary for a smaller rectangle or polygon. For example:

will read in a boundary from the file county.bln which is a small area within the Wolfcamp study area. The Wolfcamp data set covers the Texaspanhandle and a little of New Mexico The county boundary is a fictitious area of interest in the Deaf Smith area west of Hereford, Texas. The mapping parameters dialog will then look like:

county_boundary_dialog

Clicking on the button will allow PG2000 to plot the sample data.

You can copy the plot with and paste it into another application. Some systems (notably Windows NT) require pressing . This will place a copy of the Window in the clipboard. You can import the picture into a Word processing application such as Microsoft Word, a spreadsheet application like Lotus or Excel, or paste into many applications, such as MSPaint. The text information is also copied to the GHOST.LIS file.

Sample layout - nearest neighbour analysis

When analysing spatial data, one of the most important types of information we need is the spacing between the samples. This will help us to choose search radii in estimation routines so as to balance density of sampling against computation time. A large search radius will ensure the inclusion of large numbers of samples. However, if too large a radius is selected, the software will spend more time in eliminating the excess samples than in finding the relevant ones.

The inter-sample distance is also useful in determining the grouping intervals for experimental semi-variogram calculation. If the sampling is extremely irregular, it may be difficult to establish an optimum distance interval empirically.

A third use of "nearest neighbour" analysis is in the identification of duplicate sampling before kriging. The kriging routines in PG2000 assume that you wish estimation to "honour" the sample data. This is difficult to do if you have two samples at the same location! You should also bear in mind the computational efficiency of micro-computers. Most PC software works with an effective precision of around 8 or 9 significant figures. If your co-ordinates are in the millions (such as in the LO system) the computer will not be able to distinguish between samples less than, say, half a metre apart. Whilst PG2000 provides a facility for "stripping" the redundant leading digits, this may still lead to problems.

A routine, then, is provided for calculating and storing (on request) nearest neighbour distances between sample locations. Remember that this type of calculation will take exactly twice the time of a corresponding semi-variogram analysis, since it must pair every sample up with every other sample --- both ways. Selecting nearest neighbour analysis:

The routine needs two co-ordinates for the sample locations. If you have not chosen any variables before this, you will have to select "X co-ordinate" and "Y co-ordinate" as described previously. Since we already have these variables selected, we will be offered the choice to keep them:

acceptable_X_and_Y

Click on to continue with these variables. You will be prompted for the "threshold" distance defining when samples are too close together.

For this illustration we have chosen the value of 0.1 mile as our criterion for samples being too close together. If you have also elected to store the results on a file, the default extension for nearest neighbour files will be .NND. Progress bars will indicate how the calculations are going. Don’t get too impatient. This is one of the most time consuming exercises available in PG2000.

When all the nearest neighbours have been identified --- and stored on file if requested --- various options will be offered.

If you feel you have set the threshhold distance for "duplicate" locations too low (or too high!), you have an option to .

If you choose the option a histogram is constructed and displayed showing the distribution of inter-sample distances.

The Window also gives various summary statistics. For example, the average distance to the nearest samples is over 9 miles. However there are several pairs of samples closer than 1 mile and some where the nearest well is over 20 miles away.

As an illustration, I repeated the analysis with a threshhold distance of 1 mile(!). Three samples were found to have nearest neighbours closer than one mile. These are coloured red if you have selected minimal or full graphics.

The buttons at the top of the screen are modified:

Because there are now sample pairs within the defined "too close together" distance, you are offered the option: . A new dialog displaying the relevant information is given:

The grid in the dialog shows which samples (by number) were giving problems and the co-ordinates of the first of these samples. The actual distance between the two is also given, so that you can check whether they are really duplicates or just close together. In this case it is clear that the samples are not (in any sense) duplicates.

Should you really have cause for concern, you may create a new data file with these samples eliminated from the data set. It is, obviously, preferable for you to review your data and find out just why you have duplicated samples. If this is a normal part of your type of data, you will have to do something about this before you try any geostatistical estimation such as kriging. One option available within PG2000 is to "decluster" the data by averaging into small rectangular cells. This option is available on the same menu as the other data manipulation routines.

If you request an output file with the duplicates eliminated, the software will prompt for a name for the new file whose default extension will be .DAT. Because of this, no default name is provided. This is in order not to overwrite your original data file by mistake. The problem samples will still be on the new data file, but all the measurements will have been replaced by "missing" values. This is a rather unsubtle way of making sure that you do not disrupt your kriging system with duplicate samples.

Mapping with inverse distance weighting

Back to the main menu:

Interpolating a grid of points will produce a sketch map of the sample values. This map reflects the actual values measured at the actual sample locations and uses a weighted average estimator for grid points which have not been sampled. Weights are chosen as follows:

{#} the distance between sample and unsampled point is calculated;

{#} a selected function of that distance is calculated;

{#} weights are distributed between the samples according to this distance function (total weights add up to one).

PG2000 will remember everything which has been defined during this run. So far, we have defined: which variables we have been analysing and a boundary for the area being estimated:

acceptable_variables

The routine also needs to know whether you want the results stored on a "grid" file:

The default name for a grid file is the original data file name with the extension .GID. PG2000 will suggest contour levels based on the variability of the sample values.

You can change these if you so desire. Alternatively you can run with the default contours and draw prettier maps by reading the grid files back in. Please note that "grid" files are not in the same format as "data" files. If you want to read them back in, you must use the option:

This option may be greyed out in teaching or demo versions of the software.

The software offers several alternative inverse distance weighting functions:

Click on the relevant button to make your choice. I chose simple inverse distance for this illustration.

If you choose or the lower boxes will activate so that you can specify the power you require:

Once you have selected your weighting function, you need to define search parameters and the area which is to be studied. The neighbouring samples will be used to produce an estimate at each unsampled grid point. Before we can go any further, we need to define the "neighbourhood". That is, how far do we want the software to search for samples to be included in the estimation process.

PG2000 cannot guess what an appropriate search radius would be. As a simple default, we choose an area which will contain (on average) 20 samples. This is found by simply dividing the rectangular area around the samples by the number of samples –- and then multiplying this area by 20. Finding the radius of the circle with this area produces a likely search radius. When the value at a specified grid point is being estimated, all samples within this circle of the point will be used in the Kriging process. If there are too many samples within this circle, those closest to the "unsampled" location will be selected.

In this run, we already defined a boundary of interest to us. If you wish to change this boundary and, say, look at the whole Wolfcamp area, simply click on .

Once you have chosen the area to be studied, you must define the grid spacing to be used. Points will be calculated at each grid node and represented on the screen as a shaded rectangle of the appropriate size.

Since we have not previously specified a grid spacing or number of grid points, the software defaults to 25 points in the X direction and the same grid spacing in the Y direction. The grid does not have to be square, but the map may look a little strange if it isn’t! We can alter the grid spacing by changing the number in the relevant box:

If you make a change and want to check how many grid points you have before proceeding, click on and the rest of the parameters will be updated. You may also change minimum and maximum X and Y values at this stage. Once you click on the map parameters will be defined.

Interpolating a grid of points produces a sketch map on the screen. The shading information for the contour levels will appear in the left hand box and the map itself in the right. A shaded square will be displayed on the map to show you which point is being estimated in addition to the information in the prompt box. You may copy the screen to your printer at any stage during the estimation process.

The map will not show the sampled locations within the map area, since these are "honoured" by the interpolation process. If you want to see where the data lies, click on the appropriate selection from the options bar at the top of the page:

You can see that in this example only 4 samples lie inside the selected area. The samples outside the area are used in the estimation of all points within the search radius, so that the boundary may be considered "soft".

I opted to plot all the sample locations for this display.

If you do not want the sample locations shown on the map, you can remove them by clicking the appropriate button on the options bar:

This option bar also allows you return to the main menu options when you are ready.

Finishing the Tutorial

Clicking on this menu item or on will end your run with the software. You will see the closing down dialog box:

closing_down

The above Tutorial session should serve only to illustrate a possible use of the various routines from PG2000. Try running the program again, choosing your own responses. try looking at reef width instead of grade. This variable has a standard two parameter lognormal distribution. Try reading in one of the other data files which are provided, say, samples.dat.

General Notes

There are a few points which you may have noted in following the Tutorial session above. Most of the routines communicate between themselves, without you having to worry about getting the right information from one to the other. For example, after you read in the complete contents of the data file, the routines ask which of the variables you actually want to analysis. This information is then stored internally and may be accessed by any of the other routines. This is a feature of most of PG2000, in that it will recall what you chose previously and ask whether this is to change or not. You should bear this in mind if you are analysing more than one data file in a single run. In particular, the boundary used in mapping will be remembered. If you change data file or even which variables you analyse this will not automatically update.

PG2000 does not distinguish between upper and lower case letters, so you may type in whatever you find most pleasing. When the program requires a numerical answer, your input will be checked to make sure that it is actually a number. If you type in any illegal characters and press ENTER, the checking routine will filter out the unacceptable characters which you type. It should be noted that, if the routine is expecting a whole number then a decimal point is unacceptable. Much of the numerical input is checked for valid values.

A copy of this run should have been made on a file called GHOST.LIS unless you changed the name at the beginning of the run. Send this file to your printer if you want a record of the analysis or look at it with Wordpad or Notepad.

PG2000 — like any computer software — is not completely error-free. Neither is it fool-proof. You can always get out of the software by right clicking on the Taskbar. This will invoke the 'End Task' facility to close the Window without damaging the rest of your system. If you cannot figure out what went wrong, note down as much information as you can about the program you were running, the data you were using and exactly where it broke down. Contact your supplier locally or Geostokos direct for assistance, software@kriging.com. Send us the ghost.lis file and (if you can) the data you were analysing at the time.