Introduction Of Decision Trees

0

Decision trees

Decision trees are basic machine learning tools used for classification and regression.

Classification Task Example

  • Predicting whether an email is spam or not spam.

Regression Task Example

  • Predicting the price of a house based on features like size, location, and number of rooms.

Structure of a Decision Tree

Nodes

  • Each node represents a feature in the dataset.

Decision Nodes

  • These nodes split into branches based on the value of a feature.

Leaf Nodes

  • Terminal nodes that provide the final output or prediction.
  • In classification, each leaf node corresponds to a class label (e.g., Yes/No).
  • In regression, each leaf node represents a numerical value (e.g., predicted price).

Example

Features and Class Labels

  • Feature: Weather (Sunny, Rainy).
  • Class Label: Play (Yes, No).

Decision Tree Structure

  • Root Node: Weather.
    • If SunnyYes (Play).
    • If RainyNo (Don’t Play).

This example shows how features (Weather) and class labels (Play) are used to make predictions in a decision tree.

Splitting Criteria in Decision Trees

The splitting criteria decide how the data is divided at each decision point in a decision tree. The goal is to make the data in each group as pure or accurate as possible.

1. For Classification Tasks

  • Gini Impurity:
    • Measures how mixed the classes are in a group.
    • A lower Gini value means better separation of classes.
  • Entropy (Information Gain):
    • Measures the disorder or randomness in the data.
    • A higher Information Gain means the split is better at reducing uncertainty.

2. For Regression Tasks

  • Mean Squared Error (MSE):
    • Measures the average squared difference between predicted and actual values.
    • Lower MSE means a more accurate prediction.
  • Mean Absolute Error (MAE):
    • Measures the average absolute difference between predicted and actual values.
    • Lower MAE means less error in predictions.

Example of Splitting Criteria in Decision Trees

Let’s take a simple dataset to explain Gini Impurity, Entropy, Mean Squared Error (MSE), and Mean Absolute Error (MAE).

Dataset:

Summary

  • Gini Impurity: Measures how mixed the classes are; lower values indicate better splits.
  • Entropy: Measures the disorder in the dataset; higher values indicate more uncertainty.
  • MSE: Measures prediction accuracy by averaging squared errors; lower values indicate better splits.
  • MAE: Measures prediction accuracy by averaging absolute errors; lower values indicate better splits.

The Gini Impurity value of 0.5 indicates that the classes are perfectly mixed (50% of Yes and 50% of No), which is the case when the class distribution is balanced. A Gini Impurity of 0 represents a pure node (all samples belong to one class), and 0.5 is the maximum value, showing maximum disorder when classes are evenly distributed.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top