Deadly or Delicious?

Sarah Stevens
5 min readMay 5, 2021

Identifying poisonous mushrooms with machine learning.

For those into mushroom hunting, knowing the difference between a poisonous or edible variety is essential. But what if you’re just getting into the hunt yourself? Parsing through each ‘shroom’s properties leaves plenty of room for error — and this isn’t something you want to take your chances with! Luckily, machine learning provides a reliable alternative.

We will seek to answer the following questions with our analysis:

  1. Can we distinguish between the 23 mushroom species based on similar traits?
  2. Are there certain traits that are more important for identifying edibleness than others?
  3. How accurately can the model distinguish between poisonous and edible mushrooms using the traits in the dataset?

Data Collection & Prep

Data for this exploration comes from the Kaggle Mushroom Classification dataset, originally sourced from the UCI Machine Learning repository. Over 8,000 samples are included from 23 species of gilled mushrooms, and each is coded as either edible or poisonous. Characteristics such as gill size and color, habitat, and cap shape and color are included.

All data in this set is categorical, and was originally coded as such with letters. Since our models require numerical data (even if categorical), the letters were converted to numbers corresponding to their position in the alphabet.

(1) Distinguishing Between Mushroom Species

Given that the data does not provide the true species of a sample but does tell us that there are 23 species present in the dataset, can we determine that there are in fact 23 species here? The answer is yes!

Using unsupervised learning techniques like K-Means we can check to see if there are clusters present in our data. Looping through different values of k and evaluating their sum of squared errors (SSE), we see the following results.

Finding the optimal k value for the K-Means algorithm. K=23 and on look pretty good!

Normally, this plot would yield a more defined “elbow” in the curve, indicating the point where additional values of k do not provide significant additional model performance. But, we don’t see such a defined point here. There is still a gradual decline in the SSE from k = 23 and on, but it is much less steep than the earlier values of k.

So, although not perfectly clear, we can say that the model would be able to distinguish between species of mushroom — at least into 23 different species, and perhaps into additional sub-species as well.

(2) Identifying Important Features for Edible vs. Poisonous Mushrooms

Because the predictor variables are categorical and the output variable is also categorical, a ‘chi2’ test was used to determine which variables carried the most weight in the model. Below, we see that four variables stand out in particular. Two more could be considered runner-ups, but the rest pale in comparison.

Variables and their corresponding chi2 scores. The higher the better!

Linking the feature numbers to their variable names, we see the following results. Features 10 and 18 are the mentioned runner-ups.

  • Feature 3: bruises
  • Feature 6: gill-spacing
  • Feature 7: gill-size
  • Feature 8: gill-color
  • Feature 10: stalk-root
  • Feature 18: ring-type

So, yes, we can confidently say that there are certain mushroom characteristics that are more significant in distinguishing between edible vs. poisonous mushrooms. Gills seem to be particularly important in doing so!

(3) (Accurately) Classifying Poisonous & Edible Mushrooms

To do this, several approaches were explored — logistic regression, KNN, and random forest models. All models were run using the full set of predictor variables, only the top four most important variables, and the top four with the two runner-up variables. We would expect to see the models using the top four to have the lowest performance but still be fairly close to the performance of the all-variable model since these are the most significant in the dataset.

Here’s how each model performed using the different sets of variables.

Model performance with different sets of predictor variables.

As expected, the simpler models (logistic regression) did not perform as well as the more advanced classifiers — yielding only 95.5% accuracy when using all predictors. Usually, that’s pretty good, but we don’t want to take any chances here! That 5% error could mean the difference between a delicious or deadly bite. We do see that the models with only the top four or six variables had a lower accuracy rate, as we expected.

Both the KNN and random forest models had near equal performance with all variations of predictor variables used. With only the top four, each achieved an accuracy rate of ~95%, adding in the next top two achieved a rate of 97.6%, and then finally when all features were taken into account, the models classified each sample perfectly.

So, yes, we definitely can use machine learning to accurately determine if a mushroom is poisonous or edible based on its characteristics!

Just 23 Species?

Revisiting our analysis of if we can see the 23 species present within the data itself, two K-Means models were run and evaluated. One model was run using k = 23, representing the true number of species, and another was run using k = 30, which had the lowest SSE among all k values.

K-Means performance with different k values and variables.

We see that the top four variable model significantly outperformed both the top six and all-variable models, which makes sense since additional variables would introduce unwanted variance into the data. Even still, k = 30 significantly outperformed the true species k value too.

This could be explained as starting to overfit the data, which is likely since nearly a third more clusters were added. Or, it could indicate that among some of these species there are actually sub-species present in the dataset.

For the code used in this analysis, visit my GitHub repository.

--

--