These materials were developed by Kenneth E. Foote and Donald J. Huebner, Department of Geography, University of Texas at Austin, 1996. These materials may be used for study, research, and education in not-for-profit applications. If you link to or cite these materials, please credit the authors, Kenneth E. Foote and Donald J. Huebner, The Geographer's Craft Project, Department of Geography, The University of Colorado at Boulder. These materials may not be copied to or issued from another Web server without the authors' express permission. Copyright © 2000 All commercial rights are reserved. If you have comments or suggestions, please contact the author or Kenneth E. Foote at k.foote@colorado.edu .
This page is also available in a framed version . For convenience we have provided a full Table of Contents .
Examples of data standards:
Documentation is of critical importance in large GIS projects because the dataset will almost certainly outlive the people who created it. That is, GIS for municipal, state, and AM/FM applications are usually designed to last 50-100 years. The staff who enters the data may have long retired when a question arises about the characteristics of their work. Written documentation is essential. Some projects actually place information about data quality and quality control directly in a GIS dataset as independent layers. An example of data quality reports is:
Non-spatial attribute data should also be checked either against reality or a source of equal or greater quality. The particular tests employed will, of course, vary with the type of data used and its level of measurement. Indeed, many different tests have been developed to test the quality of interval, ordinal, and nominal data. Both parametric and nonparametric statistical tests can be employed to compare true values (those observed "on the ground") and those recorded in the dataset.
Cohen's Kappa provides just one example of the types of test employed, this one for nominal data. The following example shows how data on land cover stored in a database can be tested against reality.
See Attribute Accuracy
and Calculating Cohen's Kappa
This process of checking and calibrating a GIS is often referred to as Sensitivity Analysis. Sensitivity analysis allows the user to test how variations in data and modeling procedure influence a GIS solution. What the user does is vary the inputs of a GIS model, or the procedure itself, to see how each change alters the solution. In this way, the user can judge quite precision how data quality and error will influence subsequent modeling.
This is quite straight forward with interval/ratio input data. The user tests to see how an incremental change in an input variable changes the output of the system. From this, the user can derive "marginal sensitivity" to an input and establish "marginal weights" to compensate for error.
But sensitivity analysis can also be applied to nominal (categorical) and ordinal (ranked) input data. In these cases, data may be purposefully misclassified or misranked to see how such errors will change a solution.
Sensitivity analysis can also be used during system design and development to test the levels of precision and accuracy required to meet system goals. That is, users can experiment with data of differing levels of precision and accuracy to see how they perform. If a test solution is not accurate or precise enough in one pass, the levels can be refined and tested again. Such testing of accuracy and precision is very important in large GIS projects that will generated large quantities of data. In is of little use (and tremendous cost) to gather and store data to levels of accuracy and precision beyond what is needed to reach a particular modeling need.
Sensitivity can also be useful at the design stage in testing the theoretical parameters of a GIS model. It is sometimes the case that a factor, though of seemingly great theoretical importance to a solution, proves to be of little value in solving a particular problem. For example, soil type is certainly important in predicting crop yields but, if soil type varies little in a particular region, it is a waste of time entering into a dataset designed for this purpose. Users can check on such situations by selectively removing certain data layers from the modeling process. If they make no difference to the solutions, then no further data entry needs to be made.
To see how sensitivity analysis might be applied to a problem concerned with upgrading a municipal water system, go to the following section on Sensitivity Analysis.
In closing this example, it is useful to note that the results were reported in terms of ranking. No single solution was optimal in all cases. Picking a single, best solution might be misleading. Instead, the sites are simply ranked by the number of situations in which each comes out ahead.
As examples of what this means, consider:
Population figures are reported in whole numbers (5,421, 10,238, etc.) meaning that calculations can be carried down 1 decimal place (density of 21.5, mortality rate of 10.3).
If forest coverage is measured to the closest 10 meters, then calculations can be rounded to the closest 1 meter.
A second problem is False Certainty, that is reporting results with a degree of certitude unsupported by the natural variability of the underlying data. Most GIS solutions involve employing a wide range of data layers, each with its own natural dynamics and variability. Combining these layers can exacerbate the problem of arriving at a single, precision solution. Sensitivity analysis (discussed above) helps to indicate how much variations in one data layer will affect a solution. But GIS users should carry this lesson all the way to final solutions. These solutions are likely to be reported in terms of ranges, confidence intervals, or rankings. In some cases, this involves preparing high, low, and mid-range estimates of a solution based upon maximum, minimum, and average values of the data used in a calculation.
You will notice that the case considered above pertaining an optimal site selection problem reported it's results in terms of rankings. Each site was optimal in certain confined situations, but only a couple proved optimal in more than one situation. The results rank the number of times each site came out ahead in terms of total cost.
In situations where statistical analysis is possible, the use
of confidence intervals is recommended. Confidence intervals established
the probability of solution falling within a certain range (i.e. a 95% probability
that a solutions falls between 100m and 150m).
Burrough, P.A. 1990. Principles of Geographical Information Systems for Land Resource Assessment. Oxford: Clarendon Press.
Goodchild, M., and S. Gopal, eds. 1989. Accuracy of Spatial Databases. Bristol: Taylor and Francis.
King, J.L. and K.L. Kraemer. 1985. The Dynamics of Computing . New York: Columbia University Press.
Openshaw, S., M. Charlton, and S. Carver. 1991. Error Propagation: A Monte Carlo Simulation. In Handling Geography Information: Methodology and Potential Applications, ed. Ian Masser and Michael Blakemore, pp. 102-114. New York: John Wiley and Sons, Inc.
Scott, L.M. 1994. Identification of GIS Attribute Error Using Exploratory Data Analysis. The Professional Geographer. 46(3):378-386.
Last revised on 2002.4.10. k.foote@colorado.edu