Skip to main content

Cannabis data lacking, but machine learning could help fill the gap

Cannabis data lacking, but machine learning could help fill the gap

THC and CBD.

Anyone who has used, sold, studied or even read much about marijuana likely recognizes these acronyms as active ingredients in the plant.

But beyond intoxicating tetrahydrocannabinol (THC) and therapeutic cannabidiol (CBD), there exists a diverse array of chemicals believed to quietly interact – a phenomenon known as the ‘entourage effect’ — influencing how each unique cannabis strain makes people feel.

To date, the cannabis industry has collected remarkably little data about those lesser-known compounds, new University of Colorado Boulder research shows. But that same study, published this month in the journal PLOS ONE, suggests that a surprising scientific field could play an integral role in filling the knowledge gap.

“This paper provides a very early example of how applying advanced data science techniques could give us new insight into how this plant works,” said senior-author Brian Keegan, an assistant professor in the Department of Information Science.

A problem of missing data

Ask a dispensary bud tender for advice and it’s not uncommon for them to make generalizations, recommending, for instance, Cannabis sativa varieties for an energetic high, or Cannabis indica for a relaxing effect.

Variety names like Girl Scout Cookies or Gorilla Glue give the impression of standardization – buy it in one place and you’ll get the same product as if you buy it elsewhere, many assume.

Daniela Vergara

Biologist Daniela Vergara studies the genetics of cannabis.

But that’s often not the case, says study first-author Daniela Vergara, a research associate in the Department of Ecology and Evolutionary Biology.

Different flavonoids and terpenes can make seemingly similar varieties taste and smell different, and secondary cannabinoids may influence whether it’s relaxing or stimulating, sedating or creativity-inspiring.

The only way to truly know what’s in a variety is to measure the chemicals.

“But because regulations only require reporting on a few compounds like THC and CBD, there’s very little data being collected on these other compounds or how they interact,” said Vergara. “We’re not getting the whole picture.”

With medical or recreational marijuana now legal in 39 states, and sales in Colorado alone topping $1.7 billion in 2019, filling those knowledge gaps is more important than ever, potentially leading to product standardization or new therapies based on the entourage effect, the authors said.In hopes of getting the full picture on the plant, Vergara teamed up with Keegan to analyze a dataset of more than 17,600 cultivars of cannabis flower, supplied by one of the country’s largest cannabis testing companies, over eight years.

When assessing how much data was available on seven different cannabinoids, the researchers found – not surprisingly – that only 1.4% of cultivars were missing data about THC and 38% percent were missing data about CBD. Only 153 samples contained data on all seven cannabinoids, and some were almost never measured.

For instance, only 597 samples, less than 4%, contained information about CBDV (cannabidvarin), a non-psychoactive compound believed to quell seizures. And 62% of samples were missing data bout CBN (cannabinol), a compound often recommended for sleep.

Enter machine learning.

“We thought that data science methods could help with what is fundamentally a missing data problem,” said Keegan. “Could we use the data we have about the chemical profiles of some strains to impute, or guess, the values of those where we have no data?”

The trouble with names

Using algorithms and statistical methods, the team set out to uncover hidden patterns found in the data. Quickly, they learned that one of their key assumptions was wrong.

In the plant, THCA and CBDA (acidic forms of the cannabinoids that convert to THC and CBD with heat) both compete for the same precursor molecule, Cannabigerolic acid (CBGA). So the researchers assumed strains high in THC would be low in CBD, or vice versa.

“It didn’t turn out that that way,” said Keegan, noting that some strains were high in both. “This suggests we don’t know as much about these chemical pathways as we thought we did.”

Using a method called dimensionality reduction, they were able to cluster strains into four distinct categories based on chemical properties, each of which corresponded with different use cases (medicinal, recreational, combined, industrial).

Curiously, some varieties with the same name showed up in different clusters.

“This study reaffirms the misnaming of Cannabis varieties by the industry,” the authors noted. “Strain name is not indicative of potency or overall chemical makeup.”

Filling the blanks

Going forward, Keegan will continue using machine learning to fill gaps in the data. But to do it right requires widespread cannabis industry collaboration.

Data scientist Brian Keegan is applying machine learning to fill in gaps in understanding about cannabis.

Brian Keegan

Data scientist Brian Keegan.

“If more people would share more of their data, we could make better inferences about how these different cannabinoids work or interact with each other,” he said.

He envisions a day when custom products could be developed for medical use based on the complex entourage effect of interacting compounds. Dispensary customers could review an ingredient panel, much like the nutrition facts panel on food, before buying. And names would mean something.

“Machine learning has played a huge role in shaping other industries, from Facebook and Twitter to Target,” said Vergara. “It can help fill in the blanks for the cannabis industry as well.”