Simple GLM
These data represent the number of new AIDS cases in the Netherlands
during the greater part of the 80s and early 90s. If one
postulates a standard diffusion model for the spread of a disease, then
one would expect the number of new cases to rise exponentially.
We will check that.
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240) #enter the data by hand
t <- 1:13 #time periods (years after 1980)
plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))
m0 <- glm(y~t,poisson)
summary(m0)
par(mfrow=c(2,2))
plot(m0)
1-pchisq(80.686,11)
The diagnostics show trends in the residuals, let's try a quadratic
function in time. The deviance is also just about 0. These
are not good signs. Let's fit a quadratic time trend.
m1 <- glm(y~t+I(t^2),poisson)
summary(m1)
par(mfrow=c(2,2))
plot(m1)
beta.1 <- summary(m1)$coefficients[2,]
beta.1
ci <- c(beta.1[1]-2*beta.1[2],beta.1[1]+2*beta.1[2])
ci #95% condifence intervals
new.t <- seq(1,13,length=100)
fv <- predict(m1,data.frame(t=new.t),se=TRUE)
plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))
lines(new.t+1980,exp(fv$fit))
lines(new.t+1980,exp(fv$fit+2*fv$se.fit),lty=2)
lines(new.t+1980,exp(fv$fit-2*fv$se.fit),lty=2)
This model appears much better. It also appears that the rate of spread was declining during the tail end of the period.
Suppose one isn't happy with the assumption of a quadratic function in
time. In that case, you might consider splines for time.
Splines are a flexible way to smooth data, and are commonly
used in generalized linear modeling.
library(splines)
m2 <- glm(y~ns(t,df=3),poisson)
par(mfrow=c(2,2))
plot(m2)
fv2 <- predict(m2,data.frame(t=new.t),se=TRUE)
plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))
lines(new.t+1980,exp(fv2$fit))
lines(new.t+1980,exp(fv2$fit+2*fv2$se.fit),lty=2)
lines(new.t+1980,exp(fv2$fit-2*fv2$se.fit),lty=2)
For
larger datasets, I would recommend splines, but not necessarily for
this small dataset. The fit with df=4 is comparable to the quadratic
with df=2, and degrees of freedom are quite precious when n=13.
Spatial Data
The data for this portion come from the gamair package in R.
Since there is a chance that you can't load this
package into R in the labs, I have copied the data to the website here.
The data consist of a survey of the presence or
absence breeding Crested Larks in Portugal. The survey was conducted as
part of an effort to compile the Portugese Atlas of Breeding Birds.
The whole of Portugal was divided into 10km x 10km squares,
each square was further subdivided into 25, 2km x 2km cells, and a
number of cells (usually 6) were selected from each square.
Each selected cell was then surveyed to note wheter the
Crested Lark was breeding within it.
Question: Is this a model based or design based sample? Why
is this approach appropriate for the purpose of constructing a national
atlas?
Plot the raw data:
load('bird.Rdata')
species <- "crestlark"
op<-par(bg="white",mfrow=c(1,1),mar=c(5,5,1,1))
ind <- bird[[species]]==0&!is.na(bird[[species]])
plot(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white",
ylab="km west",xlab="km north",cex.lab=1.4,cex.axis=1.3,type="n",asp=1)
polygon(c(4000,4700,4700,4000),c(250,250,600,600),col="grey",border="black")
points(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white")
ind <- bird[[species]]==1&!is.na(bird[[species]])
with(bird,points(y[ind]/1000,1000-x[ind]/1000,pch=19,cex=.3))
par(op)
Note, there are some holes in the data. At the time
that this example was published in R, the survey was not yet complete.
The rows of interest in the dataset are 'x', and 'y', which are easting
and northing coordinates (in meters), 'crestlark', which is 1
if the bird is present, and 'QUADRICULA', which is a unique identifier
for which 10km x 10km grid the sample belongs to..
We will aggregate the data into these 10km squares so we can construct
a regular raster:
bird$n <- rep(1,length(bird$y)) # a vector of ones, when summed it will give the number of samples in each cell
bird$n[is.na(bird$crestlark)] <- NA
# Aggregate cells by quadricula
ba <- aggregate(data.matrix(bird),by=list(bird$QUADRICULA),
FUN=sum, na.rm=TRUE)
ba$x <- ba$x/25 # aggregate summed the locations also!, undo this
ba$y <- ba$y/25
Plotting this is fairly easy using the sp package. The
problem, is that, while the data are on a grid, not every grid cell has
data, and thus is not represented. I will use the
SpatialPixelsDataFrame function in the sp package which is designed to
deal with such 'sparse' grids.
library(sp)
grid.p <- SpatialPixelsDataFrame(points=cbind(ba$x,ba$y),
data=data.frame(ba$crestlark/ba$n))
spplot(grid.p,col.regions=brewer.pal(7,'Blues'),cuts=6)
The map noisy, and not particularly useful. For the purpose
of mapping, we will construct a smooth surface. We could do
something like indicator kriging on the previous map, but that doesn't
necessarily respec the [0,1] limits on prabability. We will
instead model each raster cell as a binomial probability, with a
spatially varying mean function. We will use the estimated
mean function as our map.
The mean, will be a number between 0 and 1 indicating the
probability that a 2km x 2km cell in that area contains breeding
Crested Larks.
ma <- gam(cbind(crestlark,n-crestlark)~s(x,y,k=100), data=ba,
family=binomial)
The s(x,y,k=100) term fits a smooth curve through the x and y
coordinates. The distribution family is a binomial (the
default link is logit). Also note that the Y variate contains
bot the number of cells with Crested Lark, and the number of
cells without.
fva <- predict(ma,bird,type='response')
grid.fv <- SpatialPixelsDataFrame(points=cbind(bird$x,bird$y),
data=data.frame(fva))
library(RColorBrewer)
spplot(grid.fv,col.regions=brewer.pal(7,'Blues'),cuts=6)
Looks good so far. The difference between modeling space this
way versus the way kriging does it, is that kriging assumes all of the
spatial variation is in the error, whereas we are now assuming all of
the spatial variation is in the systematic component (i.e. in the
mean). We should check the model residuals, to make sure
that there isn't any spatial correlation in the residuals.
library(geoR)
coords<-matrix(0,1004,2); coords[,1] <- ba$x; coords[,2] <- ba$y
gb <- list(data=residuals(ma,type='d'), coords=coords)
plot(variog(gb,max.dist=1e5))
Not bad, it almost looks like pure nugget. In fact, however,
there might be negative autocorrelation at near lags.
Remember, negative correlation shows up as
semivariogram values above the sill. Such short distance
negative autocorrelation is indicative of "oversmoothing" the data.
This suggests that we may be able to get a better fit by
smoothing less, however, oversmoothing is not necessarily a bad thing
when the interest in regional mapping, as opposed to point-wise
prediction.
We should also do standard residual diagnostics:
plot(fitted(ma),residuals(ma))
grid.resid <- SpatialPixelsDataFrame(points=cbind(ba$x,ba$y),
data=data.frame(residuals(ma)))
spplot(grid.resid,col.regions=brewer.pal(7,'BlRd'),cuts=6)
Not bad for a binomial residual plot. There don't appear to
be any overall trends. In general, residual plots for
binomial models are ugly, since only a few actual values are
possible (e.g. 0/2, 1/2,2/2 in a binomial model with n=2).
The map doesn't show much clustering, but that is what we
expected after seeing the variogram.