By Published: Sept. 4, 2018

Partnership with Zillow offers unprecedented ability to track where people have lived since 1810


Take a plot of land—maybe the one your house, your school or your business sits on—and imagine the way it’s changed over the last 200 years. From the very first settlers and their primitive structures to the modern homes and office complexes that likely exist there now, that one small piece of land has evolved with the times.

Two researchers at the University of Colorado Boulder are exploring human settlement and urbanization patterns in the United States between 1810 and 2015 using a groundbreaking new dataset from Zillow, the online real estate and rental marketplace. A paper describing the initial data products the researchers created using Zillow’s information was published today in the open-access online journal Scientific Data.

Leyk

Stefan Leyk

While the raw data themselves are proprietary to Zillow, the CU researchers are able to share the so-called data derivatives they have created with the research community and the public. 

These and future data products will almost certainly serve as a launch pad for a slew of novel research projects related to natural hazards, land-use changes, ecology, demography, urban geography and more, said Stefan Leyk, a CU Boulder associate professor of geography who co-authored the paper, titled “HISDAC-US, historical settlement data compilation for the conterminous United States over 200 years,” with geography graduate student Johannes Uhl.

“This is unique data that simply never existed before in this dimension,” Leyk said. “We can go back more than 200 years across most of the United States to understand where people have settled at which point in time. That’s something we simply have never, ever seen before. But before we can start on any research projects, we have to actually write data products that we can use for research that will come in the near future.”

We can go back more than 200 years across most of the United States to understand where people have settled at which point in time. That’s something we simply have never, ever seen before."

On somewhat of a whim, Leyk and Uhl reached out to Zillow roughly two years ago to see if the company would consider collaborating with them by sharing its massive cache of property data. Though they didn’t know exactly what the data looked like, Leyk and Uhl were intrigued and excited about what they might discover.

After working together on an agreement about how the information could be used and shared, Leyk and Uhl set to work sifting through Zillow’s Transaction and Assessment Dataset—or ZTRAX for short—which contained more than 374 million data records. 

In essence, Zillow had been collecting property records from as many U.S. counties as possible, dating back to the earliest structure built on each parcel of land. Zillow created its database with information from a major third-party data provider and from an internal company initiative called County Direct, which is gathering data from assessor and recorder’s offices across the country. 

This was an undertaking Leyk had attempted at one time, but found it to be an extraordinarily time- and labor-intensive process. With more than 3,100 counties in the United States, Zillow’s dataset was “a tremendous effort,” Leyk said. 

For its part, Zillow understands the value of partnering with academic researchers to help comb through and analyze its massive collection of information.

“Zillow has a huge treasure trove of really fascinating data, and there’s a lot of important research we can do with it,” said Sarah Mikhitarian, senior economist at Zillow. “Several members of our economic research team come from an academic background and are interested in the type of research that some of these other organizations are pursuing. We don’t always have the time and resources to do it, though, so it’s great to collaborate with outside researchers who do.”

After designing a data structure and extraction workflow, Leyk and Uhl sorted the data into 250-by-250-meter plots of land. They also sorted the data over time, looking at each plot every 5 years between 1810 and 2015.

With this information, they were able to sum up how much indoor building area accumulated on each plot in a given year, which indicates how intensely the land has been developed. The researchers also determined the year of the first settlement for each plot.

“We really can understand, in incredible detail, how did we occupy the landscape? What are the potential impacts because we settled in certain regions? What happened to wetlands and hydrological systems?” Leyk said, noting that the data could prove useful for interdisciplinary research related to fire- and flood-risk modeling, for example.

The data products created by Leyk and Uhl are now part of the public domain and are accessible to other researchers through the Harvard Dataverse, an open-source data repository. 

The CU researchers funded this initial project with a seed grant from the CU Population Center. Now that they have arranged the Zillow data into useful formats, they plan to go after larger federal grants from the National Science Foundation or National Institutes of Health for further research. 

Both Leyk and Uhl said they have been impressed with Zillow’s willingness to collaborate with academia and make this extremely valuable data accessible to the world.

“This project shows the benefits of collaboration between industry and research institutions,” Uhl said.