Time Series

The ts data structure

Many time series functions come with the stock installation of R. It may be helpful to store time series data as a ts structure data class in order to take advantage of certain time series plotting and analysis functions. The ts structure stores not only the dataseries, but also the time the datum was taken, as well as the "frequency" of the data.  For example, for monthly data, the frequency would usually be 12, 4 for quarterly data,  24 for hourly data, etc.

Univariate Time Series

Plotting

library(TSA)
data(flow)
frequency(flow) # Obvious functions
start(flow)
end(flow)
cycle(flow) # A useful function to return (in this instance) the month

plot(flow,ylab='flow (cfs)')
title('Monthly flow for the Iowa River')
boxplot(split(flow,cycle(flow)),names=c('J','F','M','A','M','J','J','A','S','O','N','D'))

There are clearly outliers, it is unlikely that a Gaussian distrubution accurately models the flow volumes.  Nonetheless, all of our models require Gaussianity, so we will proceed under the Gaussian assumption.

f1 <- filter(flow,rep(1,5)/5,sides=2) # 1 quarter MA
f2 <- filter(flow,rep(1,13)/13,sides=2) # 1 year MA
f3 <- filter(flow,rep(1,12*5+1)/(12*5+1),sides=2) # 5 year MA
plot(cbind(f1,f2,f3),plot.type='single',col=4:2)

plot(flow, type="p", ylab="flow")
lines(ksmooth(time(flow), flow, "normal", bandwidth=5)) # 50% is within +/- .25*bw

Compare the appearance of the moving average filter (i.e. using a boxcar kernel) and the Gaussian kernel smooth. Explain the cause of the differences for this time series.  

Visually, it also does not appear that there is a trend.  There does appear to be a "long wave" cycle (8 years or so).  Perhaps El Nino?  I would be skeptical of any attempt to model this long cycle (or even test its existence) with such a short span of data.  Note, the length of the data is long for determining autocorrelation at the scale of 1 month, but short for determining autocorrelation at the scale of 8 years.  The notion of "long" is relative.


# Now plot the autocorrelation functions...
par(mfrow=c(2,1))
acf(flow,lag.max=4*12)
pacf(flow,lag.max=4*12)
Comment on the acf and pacf plots.  What do these plots suggest for the temporal structure of the data?

Modeling a univariate time series.


arma10 <- arima(flow,order=c(1,0,0))
arma20 <- arima(flow,season=c(2,0,0))
arma01 <- arima(flow,season=c(0,0,1))
s10arma10 <- arima(flow,order=c(1,0,0),season=c(1,0,0))
# Let's test successively longer seasonal ARs using the AIC criteria up to p=6
aic <- rep(0,6) #Create storage for the aic values
bic <- rep(0,6) #Create storage for bic values
for (p in 1:6){ # This will take a while
mod <- arima(flow,order=c(1,0,0),season=c(p,0,0))
k <- 2+p # num. of parameters (p+AR param + mean)
# aic[p] <- mod$aic This is the stock calculation
aic[p] <- -2*mod$loglik + 2*k # But easy enough to do on your own...
bic[p] <- -2*mod$loglik + k*log(k)
}
plot(aic)
lines(bic)
title(expression("AIC ARMA(1,0)x(p,0)"[12]*""))

Write out the s10arma10 model by hand, including the estimated coefficient values.

At this point, I am not at all convinced that a seasonal ARMA is the right thing to do.  (The AIC goes up at 7.  I don't know if there are multiple minima, one would try this out past 7 to find out, but that just takes too long for class.)  The results are telling us that what happens this month is directly related to what happened six years ago on this this month.   Do you really believe that they are directly related?  I don't.  Rethinking the model, let's try to calculate a monthy effect, and de-trend the data that way.

monthly.mean<-tapply(flow[1:576],cycle(flow)[1:576],mean)
plot(monthly.mean)
new.data <- data.frame(flow=as.vector(flow),mean=monthly.mean[cycle(flow)])
new.data$demean <- new.data$flow - new.data$mean
flow2 <- ts(new.data,start=start(flow),frequency=12)
par(mfrow=c(2,1))
acf(flow2[,'demean'],main='demeaned Flow',lag.max=4*12)
pacf(flow2[,'demean'],main='demeaned Flow',lag.max=4*12)
This is better, I don't see any reason for seasonal effects after detrending the data. Let's do arma on the detrended data.

arma10 <- arima(flow2[,'demean'],order=c(1,0,0))
arma20 <- arima(flow2[,'demean'],order=c(2,0,0))
arma01 <- arima(flow2[,'demean'],order=c(0,0,1))
arma11 <- arima(flow2[,'demean'],order=c(1,0,1))
arma21 <- arima(flow2[,'demean'],order=c(2,0,1))
Compare the aic values for these models.  Which model would you choose based on AIC alone?

FWIW, I don't like the AIC results here.  They are not intuitive.  I have trouble explaining why the AR coefficients should go up before they go down. Sometime, you have to ignore the automatic procedures if you can't explain the results they are giving you.  I would stick with the ARMA(1,0), or perhaps the ARMA(1,1).  But probably, the ARMA(1,0), while the ARMA(1,1) has a lower AIC value, the MA coefficient is not significantly different from 0.

Forecasting

Forecasting is easy enough...

s20arma10 <- arima(flow,order=c(1,0,0),season=c(2,0,0))
sarma.pred <- predict(s20arma10,n.ahead=24)
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(sarma.pred$pred)
lines(sarma.pred$pred+1.96*sarma.pred$se,lty=2)
lines(sarma.pred$pred-1.96*sarma.pred$se,lty=2)

Clearly, the seasonal model is inadequate. Let's try forecasting the demeaned data instead...

arma10 <- arima(flow2[,'demean'],order=c(1,0,0))
arma.pred <- predict(arma10,n.ahead=24)
# This predicts deviations from montly mean, so we need to add back the monthly mean
mean.pred <- monthly.mean[cycle(arma.pred$pred)]
arma.pred$pred <- arma.pred$pred + mean.pred
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(arma.pred$pred)
lines(arma.pred$pred+1.96*arma.pred$se,lty=2)
lines(arma.pred$pred-1.96*arma.pred$se,lty=2)

Thie predictions are better, but the standard errors are clearly insufficient.  Also, this is slightly inaccurate... we have not considered the error in estimating the monthly means.  This should be added to the standard errors, making them even wider!  The best way to do this is with a full GLS model, where the months are independent variables.

flow.dat <- data.frame(flow=as.vector(flow)) #Make a new data frame with the flow data
flow.dat$month <- factor(cycle(flow.dat)) # Add the month as a factor
# It is very important that the month is categorical, and not numeric! Why?
library(nlme)
gls.mod <- gls(flow~-1+month,correlation=corARMA(p=1), method="ML",data=flow.dat)
summary(gls.mod)
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(arma.pred$pred)
lines(arma.pred$pred+1.96*arma.pred$se+1.96*1080,lty=2)
lines(arma.pred$pred-1.96*arma.pred$se-1.96*1080,lty=2)

w = filter(resid(gls.mod), filter=c(1,-0.7617653,), sides=1) # get resids
w <- w[-1] # The first is NA, remove it
# This resids should be white noise...
acf(w)
p <- Box.test(w,12,type='Ljung')
# The null hypothesis is: at least one of the autocorrelation coefficients (out to 12) is not 0.
# The df is wrong, since it does not know these are filtered residuals
pchisq(p$statistic, 11, lower=FALSE)


Compare the coefficients from gls.pred with the monthly means, and the estimated correlation coefficient with that from the arima of the detrended data.  The models are practically identical.  However, the standard errors for the coefficients in the gls model can be trusted.

What would happen if the month were treated as numeric in the linear regression?

Why is the seasonal component weaker in the Seasonal ARMA predictions than the regression predictions?

On your own

Graphically show that the data are heteroskesdic. (Hint, modify the monthly mean calculation above to calculate monthly standad deviation). Analyze the flow data, modeling the data after a log transform.  Be sure to do all of your calculations on the log data, and make back transformation of the predictions and standard errors the very last thing you do before plotting.  Give only enough evidence to explain your final model choice.

Aside: Transformation of the variable is one way to deal with heteroskedasticity.  Another method is what are called autoregressive, conditional heteroskedasticity models (ARCH).  These models allow you to describe heteroskedasticity as a function of covariates.  For example, it would regress not only the mean of the flow on the month, but it would also regress the variance of the flow on the month.  When might this be useful?  When the pattern of variance is different from the pattern of the mean.  For example, suppose that the spring and fall had roughly the same mean flow, but spring flows were more variable.  Then one might consider an ARCH model in this instance.  ARCH models are standard practice now for some economic series -- I have seen some ARCH applications in the physical sciences, but I do not know if they are common.