[Dr.Lib] Data Mining: KNN and Decision Tree

Recently I want to write a technical post in English.
It would have been Fully modular ASP.Net assembly but I'd wait until .Net 2.1's release because it fixed a critical issue related to my project.
So I decided to write down my Data Mining course's laboratory report, as a reminder and a note.


I use Anaconda3 as my python3 env. Here is my conda's environment.yml for each experiment.


Train Decision Trees and KNN in scikit-learn against two dataset "mushroom" and "segment". Analyze theirs performance and characteristic based on the result.

Dataset "mushroom"

M. Original Data

The mushroom.dat is a space separated matrix readily for numpy.loadtxt.

M. Decision Tree

Done. Err… Let's see what we got first

We got a mushroom.pdf in our working directory.
It's a graph describing the structure of the tree we built. But we can hardly exact useful information from it as there are only numbers instead of pretty names.
And the data are interpreted as continuous numeric variables, which actually is categorical variables.
The mushroom.dat we used above has already been pre-processed, so I downloaded the raw csv from https://www.kaggle.com/uciml/mushroom-classification and baked a name-number mapping using this script.

Now we have a categorical data set with meaningful values.
But before we feed them to the DecisionTreeClassifier…
Uh, DecisionTreeClassifier do not handle categorical data, "the decision trees implemented in scikit-learn uses only numerical features and these features are interpreted always as continuous numeric variables."(and there is a PR in sklearn but not merged yet).

Back to the last but two paragraph above, "The mushroom.dat we used above has already been pre-processed" — Yeah I realized that our teacher processed the data with a numeric encoding so we do not need to worry about that DecisionTreeClassifier didn't accepts the csv. And some people on SO suggests OneHotEncoder with categorical data. Will the performance differs if we apply a OneHotEncoder to the features?

M. DTC: Accuracy

We use random sampling to estimating these two pre-process method with Decision Tree Classifier’s accuracy. Here is the full python script in this step

For reproducibility, I generated five random number as random_state input. And by default train_test_split use test_size = 0.25. Let's run test() to see theirs score for the five rounds and two encoding.

…Prefect score for all the cases. Actually even with test_size = 0.75, we still get most score eq 1.00000. And we can still get a highly accurate model with test_size = 0.95.
Not too much difference between those two encoding and OneHot may perform worse than Numeric with a small training set.

M. DTC: Visualize and Analyze

As we proved the effectiveness of our trained model, let's go back to the visualized tree. We have the name of those features now, we can get more idea from these graph.
Using random_state = 23 and entire set as training data, we got this picture:

We can quickly discover the facts that odor made a significant role in telling the poisonous one from edibles as they occupy the first two nodes.
For feature odor, we have this mapping:

Yup, we can say that all the mushrooms there that smell terrible are poisonous.
And for the nose-friendly mushrooms, there are small chance that you still get a poisonous one. For other features, we can also perform similar analyze.

M. K-Nearest Neighbors

For the KNN, we can reuse most code from decision tree's. I sorted out the code a little and here is the full script:

M. KNN: Performance and Accuracy

Runs test_knn(), we can instantly find the difference — it runs slower and is less accurate than decision tree.

And with lower training set size, the scores drop quickly. In contrast to DTC, OneHot outperforms Numeric for most case. But those score are still good enough as most scores are higher than 0.99.

M. Further Thoughts

Why the decision tree works so well and is better than KNN approach?
We already know that odor makes a significant part in classify. With DTC, the top two root node utilizing mushroom's odor quickly distinguish most poisonous ones, but for the KNN there is no difference between one feature to the other. And the distance here didn't make too much sense here as the features are inherent categorical. And this also explains why OneHot Encoding fits well with KNN.
Here are some other things around numeric encoding that engage my attention. Although odor feature is categorical, numeric encoding coincidentally placed the poisonous smells together.

odor <= 25.5
(25, 'pungent') is poisonous
25.5 < odor <= 28.5 (26, 'almond'), (27, 'anise'), (28, 'none') most of them are edible (actually all 26s and 27s are edible) odor > 28.5
(29, 'foul'), (30, 'creosote'), (31, 'fishy'), (32, 'spicy'), (33, 'musty') all of them are poisonous

what if we mix up them a bit? I modified the data a little in mushroom.obf.dat.

(25, 'pungent') is poisonous
(26, 'foul') is poisonous
(27, 'almond') is edible
(28, 'creosote') (29, 'fishy') are poisonous
(30, 'none') is mostly edible
(31, 'spicy') is poisonous
(32, 'anise') is edible
(33, 'musty') is poisonous

Although the dataset is identical to non-obfuscated one, but it generates a very different decision tree.

The final script with all the data and reports are packed here.

Dataset "segment"

S. Original Data

There are two arff-formated .txt files, "segment-train.txt" and "segment-test.txt". We can load them by scipy.io.arff and then convert them to pandas's dataframe.

S. Decision Tree

We can easily reuse the code above.

S. DTC: Performance and Accuracy

Unlike dataset "mushroom", DTC's average f1-scores are merely 0.96. That means we needs some optimization to our decision tree to achieve better performance.
At the first glance of visualized tree, I immediately spotted some clue on overfitting. Let run some test against the training set.

Yep, full marks. We got an slightly over-fitting tree. Let's follow the tips from http://scikit-learn.org/stable/modules/tree.html#tips-on-practical-use and seek better performance.

After some rounds of tests, I got a better score with (min_samples_split=10, max_depth=9, criterion='entropy', splitter='best')

S. DTC: Visualize

S. K-Nearest Neighbors

S. KNN: Performance and Accuracy

I don't why normalizing the data leads to worse result…And the best k for this dataset is 1…
And the final script here segment.full.zip

CC BY-SA 4.0 [Dr.Lib] Data Mining: KNN and Decision Tree by Liqueur Librazy is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


电子邮件地址不会被公开。 必填项已用*标注