Lab 9
Point Patterns.
Useful packages in R for analyzing point patterns include: spatial, splancs, spatstat and spatialkernel.
There are also some useful methods in sp for storing
spatial patterns in a consistent manner and for plotting them.
Second Order Statistics
We will consider spatial analysis for three different processes: the
locations of cell centers in a histological section, the location of
Japanese black pines, and the location of California redwood trees.
Data Input and plotting
##############################
##############################
library(spatstat)
data(japanesepines)
data(redwoodfull)
data(cells)
summary(japanesepines)
summary(redwoodfull)
summary(cells)
# Turn these objects into 'SpatialPoints' objects
spjpines <- as(japanesepines, 'SpatialPoints')
spred <- as(redwoodfull, "SpatialPoints")
spcells <- as(cells, "SpatialPoints")
summary(spjpines)
summary(spred)
summary(spcells)
# Note that while second two are on the unit square,
# the first isn't.
# Let's fix that
library(maptools)
spjpines1 <- elide(spjpines, scale=T, unitsq = T)
summary(spjpines1)
####################################
####################################
At this point, all three datasets are in comparable form. I
want to plot them using the pretty xyplot function, so I need to put
all three datasets into one dataframe:
####################################
####################################
# Put the coordinates in a data frame
dpp<-data.frame(rbind(coordinates(spjpines1),
coordinates(spred),
coordinates(spcells)))
njap<-nrow(coordinates(spjpines1)) # num of pines
nred<-nrow(coordinates(spred)) # num of redwoods
ncells<-nrow(coordinates(spcells)) # num of cells
# Add labels to the dataset
dpp<-cbind(dpp,c(rep("JAPANESE",njap), rep("REDWOOD", nred),
rep("CELLS", ncells)))
names(dpp)<-c("x", "y", "DATASET") # Add column names to the
dataset
library(lattice)
print(xyplot(y~x|DATASET, data=dpp, pch=19, aspect=1,layout=c(3,1)))
####################################
####################################
Compare the appearance of these plots. Do they appear to be
clustered, random, or regular?
Second Order Statistics
Now we get to the heart of spatial analysis: establishing if these
points are consistent with the null of hypothesis of spatial
independence. It is actually quite difficult to test this
hypothesis rigorously: we do not know how to quantify and measure
"spatial independence." The usual method of attack, is to 1)
choose some statistic that we can measure, 2) calculate that statistic
for the data at hand, 3) find out what that distribution of
the statistic looks like under the null hypothesis of "spatial
independence," and 4) find out if our data's statistic is reasonable or
likely under the hypothesis of "spatial independence". The
problem is that for any statistic we choose, some processes that are
not spatially independent may give the exact same values as statistics
that are spatially independent. This is less likely for some
statistics than for others.
The first statistic we will look at is the "G statistic", or the
"nearest neighbor statistic." It measures, for any distance
r, the fraction of data points with a nearest neighbor less than r
units away.
####################################
####################################
r <- seq(0, sqrt(2)/6, by = 0.005) # Set up distances
envjap <- envelope(as(spjpines1, "ppp"), fun=Gest, r=r, nrank=2,
nsim=99)
envred <- envelope(as(spred, "ppp"), fun=Gest, r=r, nrank=2,
nsim=99)
envcells <- envelope(as(spcells, "ppp"), fun=Gest, r=r, nrank=2,
nsim=99)
####################################
####################################
The envelope function in the spatstat library is pretty cool, it
simulates datasets, applies any statistic you give it (in this case the
Gest function), and then returns lower and upper confidence intervals.
Note, the envelope function does not like 'SpatialPoints'
objects, but 'ppp' objects (as defined in the spatstat library).
We will plot these two different ways, one using the observed G
functions vs. distance, and one using the observed G function vs. the
theoretical G function. The theoretical values are
techinically only correct when their are not edge effects, but the
simulations will also be wrong in the same way, so it kinda doesn't
matter.
###################################################
###################################################
print(xyplot(obs~r|DATASET , data=Gresults, type="l", layout=c(3,1),
panel=function(x, y, subscripts)
{
lpolygon(c(x, rev(x)),
c(Gresults$lo[subscripts],
rev(Gresults$hi[subscripts])),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='G function',
xlab='Distance',
ylab='Observed'
))
print(xyplot(obs~theo|DATASET , data=Gresults, type="l", layout=c(3,1),
panel=function(x, y, subscripts)
{
lpolygon(c(x, rev(x)),
c(Gresults$lo[subscripts],
rev(Gresults$hi[subscripts])),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='G function',
xlab='Theoretical',
ylab='Observed'
))
################################################3
#################################################
Explain the distance to the nearest neighbor in these three datasets.
How are these datasets consistent or inconsistent with the
hypothesis of no spatial independence?
A similar statistic is the F statistic. Rather than measuring
the distance from any datum to its nearest neighbor, the F statistic
measure the distance from an arbitrary location (whether there is datum
there or not) to its nearest neighbor. This is sometimes
called the "empty space function"; it measures the average size of the
empty spaces in the dataset.
################################################
#################################################
envjap <- envelope(as(spjpines1, "ppp"), fun=Fest, r=r, nrank=2,
nsim=99)
envred <- envelope(as(spred, "ppp"), fun=Fest, r=r, nrank=2,
nsim=99)
envcells <- envelope(as(spcells, "ppp"), fun=Fest, r=r, nrank=2,
nsim=99)
Fresults <- rbind(envjap, envred, envcells)
Fresults <- cbind(Fresults,
DATASET=rep(c("JAPANESE", "REDWOOD", "CELLS"),
each=length(r)))
print(xyplot(obs~r|DATASET , data=Fresults, type="l", layout=c(3,1),
panel=function(x, y, subscripts)
{
lpolygon(c(x, rev(x)),
c(Fresults$lo[subscripts],
rev(Fresults$hi[subscripts])),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='F function',
xlab='Distance',
ylab='Observed'
))
print(xyplot(obs~theo|DATASET , data=Fresults, type="l", layout=c(3,1),
panel=function(x, y, subscripts)
{
lpolygon(c(x, rev(x)),
c(Fresults$lo[subscripts],
rev(Fresults$hi[subscripts])),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='F function',
xlab='Theoretical',
ylab='Observed'
))
#################################################
#################################################
Explain the
distance to the nearest points in these three datasets. How
are
these datasets consistent or inconsistent with the hypothesis of no
spatial independence?
Finally, we will consider the K statistic. Unlike the the
previous statistics, which just consider the distance to the nearest
point, this statistic measures the distance to all points.
What fraction of points are within distance r=.05? within
distance r=.1? within distance r=.15? etc. There is yet another way to
plot these functions. The second way plots observed -
theoretical vs distance. For the K function, this plotting
method is the most common, since positive values represent positive
clustering, and negative values represent negative clustering (i.e.
regularity). The K function is sometimes helpful to detect
clustering at different scales.
#################################################
#################################################
envjap <- envelope(as(spjpines1, "ppp"), fun=Kest, r=r, nrank=2,
nsim=99)
envred <- envelope(as(spred, "ppp"), fun=Kest, r=r, nrank=2,
nsim=99)
envcells <- envelope(as(spcells, "ppp"), fun=Kest, r=r, nrank=2,
nsim=99)
Kresults <- rbind(envjap, envred, envcells)
Kresults <- cbind(Kresults,
DATASET=rep(c("JAPANESE", "REDWOOD", "CELLS"),
each=length(r)))
print(xyplot(obs~r|DATASET , data=Kresults, type="l", layout=c(3,1),
panel=function(x, y, subscripts)
{
lpolygon(c(x, rev(x)),
c(Kresults$lo[subscripts],
rev(Kresults$hi[subscripts])),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='L function',
xlab='Theoretical',
ylab='Observed'
))
print(xyplot((obs-theo)~r|DATASET , data=Kresults, type="l",
layout=c(3,1),
panel=function(x, y, subscripts)
{
Ktheo<- Kresults$theo[subscripts]
lpolygon(c(x, rev(x)),
c(Kresults$lo[subscripts]-Ktheo,
rev(Kresults$hi[subscripts]-Ktheo)),
border="gray", fill="gray"
)
llines(x,
y, col="black", lwd=2)
},
main='K function',
xlab='Distance',
ylab='Observed-Theoretical',
ylim=c(-.06,.06)
))
#################################################
#################################################
Analysis of case/control Point Patterns
We will consider the analysis of asthma for a region of North Derby.
The data consists of shapefile of point data, roads,
pollution sources, and roads and a boundary for teh study region.
In these data, the cases represent children with asthma, and
the controls represent children randomly selected from birth records.
Analysing clustering of asthma is difficult, since we expect
the locations of all children to be clustered as well.
After unzipping the data, let's do some exploratory plotting.
Note, you may have to change the location of the directories
in the readOGR command to suit where you unzipped the data.
#################################################
#################################################
library(rgdal)
spasthma <- readOGR('.\\north_derby_asthma','spasthma')
spbdry <- readOGR('.\\north_derby_asthma','spbdry')
spsrc <- readOGR('.\\north_derby_asthma','spsrc')
sproads <- readOGR('.\\north_derby_asthma','sproads')
plot(spbdry, axes=TRUE)
plot(sproads, add=TRUE, col='red')
plot(spasthma, add=TRUE, pch=c(4,17)[(spasthma$Asthma == "case") + 1],
cex=c(0.6, 0.75)[(spasthma$Asthma == "case") + 1])
plot(spsrc, pch=22, add=TRUE, cex=1.2, bg="grey60")
#################################################
#################################################
We will now calculate kenel smoothed densities and plot them.
#############################################
#############################################
# Construct a boundary polygon for use
pbdry <- slot(slot(slot(spbdry, "polygons")[[1]],
"Polygons")[[1]], "coords")
bwasthma <- .125
library(maptools)
library(splancs)
# Here is a lot of work to get a grid topology for use in the kernel
functions
sG <- Sobj_SpatialGrid(spbdry, maxDim=50)$SG #Create a grid over
the boundary
gt <- slot(sG, "grid") #extract just the grid topology
summary(gt)
cases<-spasthma[spasthma$Asthma=="case",]
ncases<-nrow(cases)
controls<-spasthma[spasthma$Asthma=="control",]
ncontrols<-nrow(controls)
kcases<-spkernel2d(cases, pbdry, h0=bwasthma, gt)
kcontrols<-spkernel2d(controls, pbdry, h0=bwasthma, gt)
# Turn predictions into a data frame
df0 <- data.frame(kcases=kcases, kcontrols=kcontrols)
# Find out which are NA
idxna <- complete.cases(df0)
spkratio <- SpatialGridDataFrame(gt, data=df0)
#spkratio <- as(spkratio0, "SpatialPixelsDataFrame")
spkratio$kratio <- spkratio$kcases/spkratio$kcontrols
# Turn divide by 0 to NA
is.na(spkratio$kratio) <- !is.finite(spkratio$kratio)
spplot(spkratio,c('kcases','kcontrols'))
spplot(spkratio,'kratio')
#############################################
#############################################
We want to know if the two densities are equal (up to a constant
proportion). Kelsall and Diggle recommend the statistic
T=integral ( ρhat-ρ(x))2dx
where ρ(x) is the actual ratio, and ρhat is the proportion of the population that is affected.
We
will use this statistic to test the hypothesis of constant risk by
simulating the population. We will keep the locations fixed, but
we will randomly relabel points as cases or controls.
#############################################
#############################################
niter <- 99 # number of monte carlo iterations
ratio <- rep(NA, niter) # allocate space for results of ratios
pvaluemap <- rep(0, length(idxna)) # allocate space for pvalues
pvaluemap[!idxna] <- NA
# Allocate space for random labeling ratios
rlabelratio <- matrix(NA, nrow=niter, ncol=sum(idxna))
for(i in 1:niter) # For each monte carlo iteration
{
# perform index relabeling
idxrel <- sample(spasthma$Asthma) == "case" # randomly sample population
# idxrel is TRUE is sample is case, FALSE if control
casesrel <- spasthma[idxrel==T,] # Cases
controlsrel <- spasthma[idxrel==F,] # Controls
# Calculate kernel smooths
kcasesrel <- spkernel2d(casesrel, pbdry, h0=bwasthma, gt)
kcontrolsrel <- spkernel2d(controlsrel, pbdry, h0=bwasthma, gt)
# Calculate ratio
kratiorel <- kcasesrel[idxna]/kcontrolsrel[idxna]
is.na(kratiorel) <- !is.finite(kratiorel)
rlabelratio[i,] <- kratiorel
pvaluemap[idxna] <- pvaluemap[idxna] + (spkratio$kratio[idxna] < kratiorel)
}
# Find the cells that were finite in all of the simulations
idxna2 <- apply(rlabelratio, 2, function(x) all(is.finite(x)))
# calculate map average of rho for each simulation
rhomean <- apply(rlabelratio[, idxna2], 2, mean)
# find area of each cell
c <- prod(slot(gt, "cellsize"))
# Calculate difference between true surface and expected
ratiorho <- c*sum(((spkratio$kratio[idxna])[idxna2]-ncases/ncontrols)^2)
# Calculate difference between simulated surfaces and expected
ratio <- c*apply(rlabelratio[,idxna2], 1,
function(X, rho0 ){sum((X-rho0)^2)}, rho0=ncases/ncontrols
)
# Calculate p value
pvaluerho <- (sum(ratiorho < ratio)+1)/(niter+1)
#############################################
#############################################
Report the p-value for this hypothesis.
Suppose
we had rejected the hypothesis. We might want to know where the
high spots were. For this, we will use the p-value map.
#############################################
#############################################
spkratio$pvaluemap <- (pvaluemap+1)/(niter+1)
imgpvalue <- as.image.SpatialGridDataFrame(spkratio["pvaluemap"])
clpvalue <- contourLines(imgpvalue, levels=c(0,.05, .95, 1))
cl <- ContourLines2SLDF(clpvalue)
###################################################
# Layout the contour lines for later plotting.
lyt05 <- list("sp.lines", cl[cl$level == "0.05",], lwd=2, lty=2, col="grey95") # 5% CI
lyt95 <- list("sp.lines", cl[cl$level == "0.95",], lwd=2, lty=1) # 95% CI
lytb <- list("sp.polygons", spbdry) # boundary
lytp <- list("sp.points", spsrc, cex=0.9, pch=4, col="black", lwd=3) #pollution sources
brks <- quantile(spkratio$kratio[spkratio$kratio>0], seq(0,1,2/10), na.rm=TRUE)
brks[1] <- 0
lbrks <- formatC(brks, 3, 6, "g", " ") # break labels
cols <- brewer.pal(length(brks)-1,'Blues')
colorkey<-list(labels=lbrks,
at=seq(0,1,.2, height=.5))
print(spplot(spkratio, "kratio",
col.regions=cols,
do.log=TRUE,
colorkey=colorkey,
at=brks,
sp.layout=list(lyt05, lyt95, lytb, lytp)
))
#############################################
#############################################