KNN Model of — Rakeeb Alam

Disclaimer:

This project is only a demonstration of a use case for classification models and machine learning in data science, for often overlooked data. The content matter of the underlying data is not a promotion of the use of cannabis, and Rakeeb does not condone its use in any way, shape, or form.

KNN Classification Model of Cannabis Sativa/Indica Strains

Abstract

As cannabis has become fully legalized in 18 states within the US, and authorized for medical usage in many others, this has given a rise to several companies aiming to cultivate the crop. An important quality to assessing different strains of cannabis is their common classifications into “sativas,“ “indicas,” and “hybrid“ (a cross-strain mix of indica and sativa) strains. Allegedly, each archetype of strain is characterized by not only its THC-content levels, but also their effects on the user after consumption.

Thus, this project aims to build a K-Nearest Neighbors (KNN) classification model that uses THC-level data and user-reported sensational data to classify cannabis strains into 1 of 3 categories: hybrid, indica, or sativa.

The final KNN model ultimately had an accuracy rate of ~59%. Confusion matrix analysis shows that the model itself was decent at classifying hybrid strains, but faltered in determining indica or sativa strains. Further analysis shows that a likely reason for these results was a lack of data for indica and sativa strains, demonstrating the importance of sample size in data science along with the limitations of classification models.

Introduction

Many states in recent years have begun full legalization of cannabis for both medical and recreational use, naturally resulting in the industry seeing several companies open to begin the cultivation and sale of cannabis. And in this burgeoning industry, similarly to the alcohol industry, many firms aim to distinguish themselves from competitors by promoting their own variations and strains of cannabis that have allegedly different effects on users. Moreover, being that cannabis is a highly regulated product, many legal states require that firms report data on their strains to government authorities in order to maintain a legal operating status.

In fact, an important and popular method to classify cannabis plant strains is by 1 of 3 categories: “sativa,“ “indica,” or “hybrid“ (a cross-strain mix of sativa or indica). Supposedly, each variation has characteristically different THC-levels (THC is an active chemical in cannabis that induces various psychological and physiological effects) and effects on users post-consumption. This raises a possible question, could one build a model that takes such data in order to classify a given cannabis strain? This could allow for another avenue to accurate determinations of strain types along with identification of newly developed strains in the industry.

This endeavor would be a perfect use case for a K-Nearest Neighbors (KNN) classification model. KNN models take in numeric, independent variables and then aims to classify each data point into a category (a categorical dependent variable). Moreover, an important distinction of KNN models over other classifier models is that KNN employs supervised machine learning. In other words, the model in this case assumes that the valid categories are given or already known from the get-go, which would be the case for cannabis strains.

The data for this model comes from a public source Kaggle dataset (source at end of project page) on various strains’ data, ranging from name, strain type, THC-level, and various user-reported data on various psychological effects.

And so, the plan for this project is as follows:

Use dimensional reduction (through Principle Component Analysis) to combat the curse of dimensionality and multicollinearity issues in the original dataset.
Using the newly transformed data and reduced variables to train a KNN model.
Determine an optimal number of neighbors (K) to use in the model, assessed by the Receiver Operating Characteristic (ROC) Curve’s Area-Under-Curve (AUC) score
Assess the updated model by accuracy and its confusion matrix.

Data Exploration and Cleaning

As with any data science project, the first task is to investigate the dataset and clean it up.

The cannabis dataset has a raw sample size of 4,762 strains, and 63 columns or variables. Each strain entry contains variables such as: name of strain, type (as in indica, sativa, or hybrid strain designation), THC-level (on a 0-100% scale), and a slew of user-reported ratings on various effects post-consumption. To elaborate, examples of effects are relaxation, dry-mouth, paranoia, energy, etc., rated on a numeric 0-100 scale.

On further reflection of the data, there are many concerns if we were to use the raw data straight into a KNN model. First is the curse of dimensionality. There are dozens of independent variables (the user-reported effects), and any model that uses so many variables in its assessment will invariably cause the model to fit the data better. But we cannot be certain if the model is fitting to actual, meaningful data, or is fitting to noise in the dataset. In other words, there is a major concern that the model will overfit the data with so many independent variables. Moreover, an important assumption in KNN models is that there is a relatively low or minimized number of input variables to prevent such overfitting.

Next, we have concerns of multicollinearity. Multicollinearity is where several input-variables are highly correlated with one another. One could imagine this to be the case in this dataset, as for example: take the two variables, paranoia and anxiety. If a user reports being paranoid after consumption of a certain strain, it is not hard to imagine that they would also likely say the strain made them anxious. The two concepts are closely related, and would be better described by a reduced number of variables, that would be orthogonal (unrelated) to other variables to prevent that multicollinearity.

In light of all of this, this dataset is a prime candidate for dimensional reduction prior to model construction. Using a form of dimensional reduction, principal component analysis, we can combat the issues of multicollinearity and the curse of dimensionality by transforming the original dataset into a reduced set of new variables that should still retain the underlying meaning captured in the original dataset.

Since we have covered dimensional reduction in a previous project, an abridged version will be presented here along with the final results that will be fed into the KNN model.

PCA of Cannabis Dataset (Abridged)

See a fuller explanation and overview of the PCA process here.

Prior to running the PCA, the columns concerning user-reported effect data was separated from the other variables (namely THC-level). The reason for this is that the THC-level data already seems orthogonal from the other “sensation“ variables (it is after all a purely empirical measure of the levels of THC within the strain).

This leaves us with 28 sensation variables ranging from ratings on effects such as but not limited to: “happiness,“ “anxiety,“ “hunger,“ and “stress.“ After normalizing the numeric data into Z-scores, they data was fed into PCA to generate the eigenvalue matrix, loadings matrix, and the newly transformed data in a new numeric scale.

Scree plot analysis using the Horn’s Criteria reveals that 5 Principle Components are meaningful in this dataset.

*Figure 1- Scree Plot + Noise Distribution Bar Chart; Any PC’s that exceed the noise distribution (visualized as the green splotch across the graph) will be considered meaningful*

Now we must relabel each of these PC’s by analyzing their respective loading matrix along with which factors from the original dataset their eigenvectors “point“ towards. An example of one such loading matrix plot for Principal Component 1 is below:

Figure 2- Loading Matrix Plot of Principal Component 1; Note how the eigenvectors show a stronger, positive affinity for certain factors over others, this will be how we determine what to relabel the chosen PC’s

In the end, the 5 new variables or principle components, based on their loading matrix analysis are to be labeled thusly:

PC 1 = Bliss (related to variables of relaxation and dry mouth)
PC 2 = Excitement (related to variables of lack-of-hunger, and energetic)
PC 3 = Flow (related to variables of focus, and dry eyes)
PC 4 = Fulfillment (related to variables of happiness and headaches)
PC 5 = Mania (related to variables of nausea, energetic, and creativity)

Now with our new set of reduced variables relabeled, we can adjoin in a single dataset our new data back to the original data that we retained: name of strain, type, and THC-level. A snippet of the data is below and exemplifies what we will work with moving forward:

*Figure 3- Snippet of new data frame with post-dimensional reduction variables included*

In the end, we now have a dataset with a sample size of 1,603 entries, 6 independent variable columns (THC-level, Bliss, Excitement, Flow, Fulfillment, and Mania), 1 dependent variable ([strain] type), and the name of the strain. Now the dataset is ready for KNN modeling.

KNN Model Part 1: KNN Overview + Generating Model for Arbitrary K

Now that we have our dataset ready, we can finally start building our classification model. We will be treating THC-level, and the 5 reduced “sensation“ variables as our inputs, in order to classify a given strain data point into 1 of 3 different categories: hybrid, indica, or sativa.

Given that our dependent variable’s categories are already defined, we can use a supervised classification model here, of which KNN should suffice. KNN is a supervised, lazy machine learning algorithm that classifies data points by calculating its distance in a coordinate space versus its “neighbors“ or K neighboring points.

As a result one of the most important parameters to set in KNN models is the value of K, or how many other neighboring data points should be considered for classification of a given data point. Moreover, K should be an odd number, as it could be possible for a data point’s classification score based on its distance to be indeterminate if with an even value of K, ie. a data point sits exactly on the frontier of 2 or more classifications.

For a first pass, we will use an arbitrary value of K for the sake of demonstration. After splitting our dataset into a training and test sets (at a composition of 70% to 30% training:test data), and using a K of 5, our KNN model returns the following summary and confusion matrix:

*Figure 4- Summary of KNN model where K = 5*

Figure 5- Confusion Matrix of KNN model where K= 5. Columns left to right are: Hybrid, Indica, and Sativa; Rows top to bottom are in the same order

Furthermore, this model has a reported accuracy rate of ~56%. In other words, this model correctly identified 56% of strains.

Now the question remains, how do we determine whether or not these results were “optimal?“

KNN Model Part 2: + K-Neighbors Selection by AUC Score

One robust method to determine the quality of a classification model is to use the area under the curve (AUC) of the Receiver’s Operating Characteristic curve (ROC). The ROC curve plots a model’s sensitivity or true positive rate a a function of the false positive rate. The central idea behind this metric is that one would like to minimize the number of false positives a classifier calls while maximizing true positive calls. The final score would be determined by simply calculating the area under the curve this function forms.

Thus, this AUC score can be calculated across many values of K, and then simply selecting K where the score is maximized, and where K is odd.

Running this analysis for K values ranging from 1-41 (41 was the end of the range as a rule of thumb is the square root of your sample size, and here n = 1,603), we get the following plot:

*Figure 6- AUC Scores plot for KNN model ranging from 1-41*

And here from the plot, we can clearly see that our original arbitrarily chose value of K = 5 was not an optimal choice, several other values of K generated a better AUC score. In fact, the maximized AUC score was at K=18, where the AUC score = 0.5870. However, as mentioned before, we cannot use an even value for K, and we will have to go with the next best, odd value of K. Here, it was determined to be K = 17, with an AUC score of 0.5820.

KNN Model Part 3: Final Model Creation + Assessment

Now that we have determined an optimized value of K at 17, we can rerun our model with the same parameters again to get our final results:

*Figure 7*- *Summary of KNN model where K = 17*

Figure 8- Confusion Matrix of KNN model where K= 17. Columns left to right are: Hybrid, Indica, and Sativa; Rows top to bottom are in the same order

Moreover, the accuracy rate was calculated to be ~59%.

We can see immediately, just based on the accuracy rate, that this model performed a bit better than our arbitrarily chose value of K. However, a 59% accuracy rate seems a bit mediocre, and a closer look at both the summary and confusion matrix for the model can give some more qualitative insights on this model’s performance.

The summary of the model reports that the KNN model correctly identified 90% of hybrid strains, but was terrible at identifying indicas or sativas, at a rate of 20% and a disconcerting 1% respectively. In fact, the confusion matrix shows that the model categorized most of the data points as hybrids incorrectly.

Additionally, the support column or sample size of each category in the test set shows an uneven distribution of hybrid data points to indicas to sativas, where hybrids were overrepresented in this dataset. And since the training data set had a similar proportion, this may lead us to think that there simply was not enough representation of indica or sativa strains in the overall dataset.

As a result, the model was likely trained very well on identifying hybrid strains from the fact that they were the most frequent strain type in the dataset. However, the underrepresentation of indicas and sativas in the dataset led to the model being less likely to identify them simply due to not having enough information, instead lumping them as hybrids and dragging the overall accuracy of the model down with it.

Concluding Remarks

Through this project, we have taken a dataset on strains of cannabis and their associated physiological/psychological effects + THC levels to create a rudimentary classification model to categorize data points into either hybrid, indica, or sativa classes. After reducing the original dataset through PCA, and using AUC scores to determine an optimal value for K, our final KNN model returns an accuracy rating of 59%. The mediocre performance of this model is likely due to the overrepresentation of hybrid strain data points in the dataset, while indica and sativa strains were underrepresented, causing the model to be more likely to lump indicas or sativas as hybrids. In fact, the model was able to correctly identify 90% of hybrid strains, but only 20% and 1% of indicas and sativas respectively.

In light of this, a most natural next step would be to simply input more indica and sativa strain data into the model, in order to train it better at identifying them. Moreover, this point illuminates an one of the most crucial aspects of data science, the sheer importance of proper sampling. In this project, the underrepresentation of a certain segment of data led the KNN model to be particularly terrible at identifying them. And one of the most robust options to remedy this is simply to have made the sample more representative from the start. No amount of fiddling after the fact can change the fact that the model simply needed more information.

However, the nature of the underlying topic in the dataset, cannabis, poses some issues for future sampling. Namely, the obvious fact that cannabis is illegal or highly restricted in several US states. Collecting data on such a topic would likely be a difficult endeavor. But this unique sampling challenge does not apply to just severely restricted commodities, but also other areas of interest: from people with rare diseases, to authoritarian countries that heavily censor internal affairs, to rare archaeological artifacts. This poses limitations for any model or data science analysis, and potential issues a budding industry such as the cannabis industry will encounter in analytics.

Data Source:

Rosa, Gustavo. Leafly: Cannabis Strains Meta-Data. Version 2. https://www.kaggle.com/datasets/gthrosa/leafly-cannabis-strains-metadata.