Simple GLM

These data represent the number of new AIDS cases in the Netherlands during the greater part of the 80s and early 90s.  If one postulates a standard diffusion model for the spread of a disease, then one would expect the number of new cases to rise exponentially.  We will check that.
y <- c(12,14,33,50,67,74,123,141,165,204,253,246,240) #enter the data by hand
t <- 1:13 #time periods (years after 1980)

plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))

m0 <- glm(y~t,poisson)
summary(m0)
par(mfrow=c(2,2))
plot(m0)
1-pchisq(80.686,11)
The diagnostics show trends in the residuals, let's try a quadratic function in time.  The deviance is also just about 0.  These are not good signs.  Let's fit a quadratic time trend.
m1 <- glm(y~t+I(t^2),poisson)
summary(m1)
par(mfrow=c(2,2))
plot(m1)

beta.1 <- summary(m1)$coefficients[2,]
beta.1
ci <- c(beta.1[1]-2*beta.1[2],beta.1[1]+2*beta.1[2])
ci #95% condifence intervals


new.t <- seq(1,13,length=100)
fv <- predict(m1,data.frame(t=new.t),se=TRUE)
plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))
lines(new.t+1980,exp(fv$fit))
lines(new.t+1980,exp(fv$fit+2*fv$se.fit),lty=2)
lines(new.t+1980,exp(fv$fit-2*fv$se.fit),lty=2)
This model appears much better.  It also appears that the rate of spread was declining during the tail end of the period.
Suppose one isn't happy with the assumption of a quadratic function in time.  In that case, you might consider splines for time.  Splines are a flexible way to smooth data, and are commonly used in generalized linear modeling.
library(splines)
m2 <- glm(y~ns(t,df=3),poisson)
par(mfrow=c(2,2))
plot(m2)


fv2 <- predict(m2,data.frame(t=new.t),se=TRUE)
plot(t+1980,y,xlab='Year',ylab='New AIDS cases',ylim=c(0,280))
lines(new.t+1980,exp(fv2$fit))
lines(new.t+1980,exp(fv2$fit+2*fv2$se.fit),lty=2)
lines(new.t+1980,exp(fv2$fit-2*fv2$se.fit),lty=2)

For larger datasets, I would recommend splines, but not necessarily for this small dataset. The fit with df=4 is comparable to the quadratic with df=2, and degrees of freedom are quite precious when n=13.

Spatial Data

The data for this portion come from the gamair package in R.  Since there is a chance that you can't load this package into R in the labs, I have copied the data to the website here.  The data consist of a survey of the presence or absence breeding Crested Larks in Portugal. The survey was conducted as part of an effort to compile the Portugese Atlas of Breeding Birds.  The whole of Portugal was divided into 10km x 10km squares, each square was further subdivided into 25, 2km x 2km cells, and a number of cells (usually 6) were selected from each square.  Each selected cell was then surveyed to note wheter the Crested Lark was breeding within it.


Question: Is this a model based or design based sample?  Why is this approach appropriate for the purpose of constructing a national atlas?

Plot the raw data:
load('bird.Rdata')
species <- "crestlark"
op<-par(bg="white",mfrow=c(1,1),mar=c(5,5,1,1))
ind <- bird[[species]]==0&!is.na(bird[[species]])
plot(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white",
ylab="km west",xlab="km north",cex.lab=1.4,cex.axis=1.3,type="n",asp=1)
polygon(c(4000,4700,4700,4000),c(250,250,600,600),col="grey",border="black")
points(bird$y[ind]/1000,1000-bird$x[ind]/1000,pch=19,cex=.3,col="white")
ind <- bird[[species]]==1&!is.na(bird[[species]])
with(bird,points(y[ind]/1000,1000-x[ind]/1000,pch=19,cex=.3))
par(op)
Note, there are some holes in the data.  At the time that this example was published in R, the survey was not yet complete.

The rows of interest in the dataset are 'x', and 'y', which are easting and northing coordinates (in meters), 'crestlark', which is 1 if the bird is present, and 'QUADRICULA', which is a unique identifier for which 10km x 10km grid the sample belongs to..

We will aggregate the data into these 10km squares so we can construct a regular raster:
bird$n <- rep(1,length(bird$y)) # a vector of ones, when summed it will give the number of samples in each cell
bird$n[is.na(bird$crestlark)] <- NA
# Aggregate cells by quadricula
ba <- aggregate(data.matrix(bird),by=list(bird$QUADRICULA),
FUN=sum, na.rm=TRUE)
ba$x <- ba$x/25 # aggregate summed the locations also!, undo this
ba$y <- ba$y/25



Plotting this is fairly easy using the sp package.  The problem, is that, while the data are on a grid, not every grid cell has data, and thus is not represented.  I will use the SpatialPixelsDataFrame function in the sp package which is designed to deal with such 'sparse' grids.

library(sp)
grid.p <- SpatialPixelsDataFrame(points=cbind(ba$x,ba$y),
data=data.frame(ba$crestlark/ba$n))
spplot(grid.p,col.regions=brewer.pal(7,'Blues'),cuts=6)

The map noisy,  and not particularly useful. For the purpose of mapping, we will construct a smooth surface.  We could do something like indicator kriging on the previous map, but that doesn't necessarily respec the [0,1] limits on prabability.  We will instead model each raster cell as a binomial probability, with a spatially varying mean function.  We will use the estimated mean function as our map. The mean, will be a number between 0 and 1 indicating the probability that a 2km x 2km cell in that area contains breeding Crested Larks.
ma <- gam(cbind(crestlark,n-crestlark)~s(x,y,k=100), data=ba, 
family=binomial)

The s(x,y,k=100) term fits a smooth curve through the x and y coordinates.  The distribution family is a binomial (the default link is logit).  Also note that the Y variate contains bot the number of cells with Crested Lark, and the number of cells without.



fva <- predict(ma,bird,type='response')
grid.fv <- SpatialPixelsDataFrame(points=cbind(bird$x,bird$y),
data=data.frame(fva))
library(RColorBrewer)
spplot(grid.fv,col.regions=brewer.pal(7,'Blues'),cuts=6)

Looks good so far.  The difference between modeling space this way versus the way kriging does it, is that kriging assumes all of the spatial variation is in the error, whereas we are now assuming all of the spatial variation is in the systematic component (i.e. in the mean).  We should check the model residuals, to make sure that  there isn't any spatial correlation in the residuals.

library(geoR)
coords<-matrix(0,1004,2); coords[,1] <- ba$x; coords[,2] <- ba$y
gb <- list(data=residuals(ma,type='d'), coords=coords)
plot(variog(gb,max.dist=1e5))

Not bad, it almost looks like pure nugget.  In fact, however, there might be negative autocorrelation at near lags.  Remember, negative correlation shows up  as semivariogram values above the sill.  Such short distance negative autocorrelation is indicative of "oversmoothing" the data.  This suggests that we may be able to get a better fit by smoothing less, however, oversmoothing is not necessarily a bad thing when the interest in regional mapping, as opposed to point-wise prediction.

We should also do standard residual diagnostics:

plot(fitted(ma),residuals(ma))
grid.resid <- SpatialPixelsDataFrame(points=cbind(ba$x,ba$y),
data=data.frame(residuals(ma)))

spplot(grid.resid,col.regions=brewer.pal(7,'BlRd'),cuts=6)

Not bad for a binomial residual plot.  There don't appear to be any overall trends.  In general, residual plots for binomial models are ugly, since only a few actual values are possible (e.g. 0/2, 1/2,2/2 in a binomial model with n=2).  The map doesn't show much clustering, but that is what we expected after seeing the variogram.