Time Series
The ts data structure
Many time series functions come with the stock installation of R. It
may be helpful to store time series data as a ts structure data class
in order to take advantage of certain time series plotting and analysis
functions. The ts structure stores not only the dataseries, but also
the time the datum was taken, as well as the "frequency" of the data.
For example, for monthly data, the frequency would usually be
12, 4 for quarterly data, 24 for hourly data, etc.
Univariate Time Series
Plotting
library(TSA)
data(flow)
frequency(flow) # Obvious functions
start(flow)
end(flow)
cycle(flow) # A useful function to return (in this instance) the month
plot(flow,ylab='flow (cfs)')
title('Monthly flow for the Iowa River')
boxplot(split(flow,cycle(flow)),names=c('J','F','M','A','M','J','J','A','S','O','N','D'))
There are clearly outliers, it is unlikely that a Gaussian distrubution
accurately models the flow volumes. Nonetheless, all of our
models require Gaussianity, so we will proceed under the Gaussian
assumption.
f1 <- filter(flow,rep(1,5)/5,sides=2) # 1 quarter MA
f2 <- filter(flow,rep(1,13)/13,sides=2) # 1 year MA
f3 <- filter(flow,rep(1,12*5+1)/(12*5+1),sides=2) # 5 year MA
plot(cbind(f1,f2,f3),plot.type='single',col=4:2)
plot(flow, type="p", ylab="flow")
lines(ksmooth(time(flow), flow, "normal", bandwidth=5)) # 50% is within +/- .25*bw
Compare the appearance of the moving average filter (i.e. using a
boxcar kernel) and the Gaussian kernel smooth. Explain the cause of the
differences for this time series.
Visually, it also does not
appear that there is a trend. There does appear to be a "long
wave" cycle (8 years or so). Perhaps El Nino? I
would be skeptical of any attempt to model this long cycle (or even
test its existence) with such a short span of data. Note, the
length of the data is long for determining autocorrelation at the scale
of 1 month, but short for determining autocorrelation at the scale of 8
years. The notion of "long" is relative.
# Now plot the autocorrelation functions...
par(mfrow=c(2,1))
acf(flow,lag.max=4*12)
pacf(flow,lag.max=4*12)
Comment on the acf and pacf plots. What do these plots
suggest for the temporal structure of the data?
Modeling a univariate time series.
arma10 <- arima(flow,order=c(1,0,0))
arma20 <- arima(flow,season=c(2,0,0))
arma01 <- arima(flow,season=c(0,0,1))
s10arma10 <- arima(flow,order=c(1,0,0),season=c(1,0,0))
# Let's test successively longer seasonal ARs using the AIC criteria up to p=6
aic <- rep(0,6) #Create storage for the aic values
bic <- rep(0,6) #Create storage for bic values
for (p in 1:6){ # This will take a while
mod <- arima(flow,order=c(1,0,0),season=c(p,0,0))
k <- 2+p # num. of parameters (p+AR param + mean)
# aic[p] <- mod$aic This is the stock calculation
aic[p] <- -2*mod$loglik + 2*k # But easy enough to do on your own...
bic[p] <- -2*mod$loglik + k*log(k)
}
plot(aic)
lines(bic)
title(expression("AIC ARMA(1,0)x(p,0)"[12]*""))
Write out the s10arma10 model by hand, including the estimated
coefficient values.
At this point, I am not at all convinced that a seasonal ARMA is the
right thing to do. (The AIC goes up at 7. I don't
know if there are multiple minima, one would try this out past 7 to
find out, but that just takes too long for class.) The
results are telling us that what happens this month is directly related
to what happened six years ago on this this month.
Do you really believe that they are directly related? I
don't. Rethinking the model, let's try to
calculate a monthy effect, and de-trend the data that way.
monthly.mean<-tapply(flow[1:576],cycle(flow)[1:576],mean)
plot(monthly.mean)
new.data <- data.frame(flow=as.vector(flow),mean=monthly.mean[cycle(flow)])
new.data$demean <- new.data$flow - new.data$mean
flow2 <- ts(new.data,start=start(flow),frequency=12)
par(mfrow=c(2,1))
acf(flow2[,'demean'],main='demeaned Flow',lag.max=4*12)
pacf(flow2[,'demean'],main='demeaned Flow',lag.max=4*12)
This is better, I don't see any reason for seasonal effects after
detrending the data. Let's do arma on the detrended data.
arma10 <- arima(flow2[,'demean'],order=c(1,0,0))
arma20 <- arima(flow2[,'demean'],order=c(2,0,0))
arma01 <- arima(flow2[,'demean'],order=c(0,0,1))
arma11 <- arima(flow2[,'demean'],order=c(1,0,1))
arma21 <- arima(flow2[,'demean'],order=c(2,0,1))
Compare the aic values for these models. Which model would
you choose based on AIC alone?
FWIW, I don't like the AIC results here. They are not
intuitive. I have trouble explaining why the AR coefficients
should go up before they go down. Sometime, you have to ignore the
automatic procedures if you can't explain the results they are giving
you. I would stick with the ARMA(1,0), or perhaps the
ARMA(1,1). But probably, the ARMA(1,0), while the ARMA(1,1)
has a lower AIC value, the MA coefficient is not significantly
different from 0.
Forecasting
Forecasting is easy enough...
s20arma10 <- arima(flow,order=c(1,0,0),season=c(2,0,0))
sarma.pred <- predict(s20arma10,n.ahead=24)
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(sarma.pred$pred)
lines(sarma.pred$pred+1.96*sarma.pred$se,lty=2)
lines(sarma.pred$pred-1.96*sarma.pred$se,lty=2)
Clearly, the seasonal model is inadequate. Let's try forecasting the
demeaned data instead...
arma10 <- arima(flow2[,'demean'],order=c(1,0,0))
arma.pred <- predict(arma10,n.ahead=24)
# This predicts deviations from montly mean, so we need to add back the monthly mean
mean.pred <- monthly.mean[cycle(arma.pred$pred)]
arma.pred$pred <- arma.pred$pred + mean.pred
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(arma.pred$pred)
lines(arma.pred$pred+1.96*arma.pred$se,lty=2)
lines(arma.pred$pred-1.96*arma.pred$se,lty=2)
Thie predictions are better, but the standard errors are clearly
insufficient. Also, this is slightly inaccurate... we have
not considered the error in estimating the monthly means.
This should be added to the standard errors, making them even
wider! The best way to do this is with a full GLS model,
where the months are independent variables.
flow.dat <- data.frame(flow=as.vector(flow)) #Make a new data frame with the flow data
flow.dat$month <- factor(cycle(flow.dat)) # Add the month as a factor
# It is very important that the month is categorical, and not numeric! Why?
library(nlme)
gls.mod <- gls(flow~-1+month,correlation=corARMA(p=1), method="ML",data=flow.dat)
summary(gls.mod)
plot(window(flow,start=2000),xlim=c(2000,2009),ylim=c(0,30000))
lines(arma.pred$pred)
lines(arma.pred$pred+1.96*arma.pred$se+1.96*1080,lty=2)
lines(arma.pred$pred-1.96*arma.pred$se-1.96*1080,lty=2)
w = filter(resid(gls.mod), filter=c(1,-0.7617653,), sides=1) # get resids
w <- w[-1] # The first is NA, remove it
# This resids should be white noise...
acf(w)
p <- Box.test(w,12,type='Ljung')
# The null hypothesis is: at least one of the autocorrelation coefficients (out to 12) is not 0.
# The df is wrong, since it does not know these are filtered residuals
pchisq(p$statistic, 11, lower=FALSE)
Compare the coefficients from gls.pred with the monthly means, and the
estimated correlation coefficient with that from the arima of the
detrended data. The models are practically identical.
However, the standard errors for the coefficients in the gls
model can be trusted.
What would happen if the month were treated as numeric in the linear
regression?
Why is the seasonal component weaker in the Seasonal ARMA predictions than the regression predictions?
On your own
Graphically show that the data are heteroskesdic. (Hint, modify the
monthly mean calculation above to calculate monthly standad deviation).
Analyze the flow data, modeling the data after a log transform.
Be sure to do all of your calculations on the log data, and
make back transformation of the predictions and standard errors the
very last thing you do before plotting. Give only enough
evidence to explain your final model choice.
Aside: Transformation of the variable is one way to deal with
heteroskedasticity. Another method is what are called
autoregressive, conditional heteroskedasticity models (ARCH).
These models allow you to describe heteroskedasticity as a
function of covariates. For example, it would regress not
only the mean of the flow on the month, but it would also regress the
variance of the flow on the month. When might this be useful?
When the pattern of variance is different from the pattern of
the mean. For example, suppose that the spring and fall had
roughly the same mean flow, but spring flows were more variable.
Then one might consider an ARCH model in this instance.
ARCH models are standard practice now for some economic
series -- I have seen some ARCH applications in the physical sciences,
but I do not know if they are common.