Decision Trees
1. Decision Trees Cheat Sheet
High-Level Overview:
Decision Trees are supervised learning models that use a series of if-then splits to partition data into increasingly homogenous subsets. Each split is based on a feature value, leading to a tree-like structure where leaf nodes provide a prediction (for classification or regression).
They are intuitive, easy to interpret, and can handle both classification (e.g., predicting if a customer will buy a product) and regression tasks (e.g., estimating sales revenue). Decision Trees often work well with tabular data—data arranged in rows and columns where each row represents an entity (like a customer) and each column represents a feature (like age, income, or region).
Key Concepts:
- Nodes: Points in the tree where a decision (split) is made based on a feature value.
- Branches: Paths extending from a node, representing different outcomes of a split.
- Leaves: Terminal nodes that provide a final prediction (class label or numeric value).
- Splitting Criterion: A measure (like Gini impurity, entropy for classification, or variance reduction for regression) that guides how to choose the best feature/value split.
How They Work:
- Start at the Root: The entire dataset (e.g., a spreadsheet of customers with columns for age, income, spending score) is at the root node.
- Choose a Split: Find a feature and threshold that best separates the data into more “pure” subsets. For classification, purity might mean one subset has mostly "buy" customers and another mostly "not buy" customers.
- Split the Data: Partition the dataset into subsets based on that feature threshold, sending some rows down one branch and the rest down another.
- Repeat Recursively: Continue splitting child nodes until reaching stopping criteria, such as a maximum depth or a minimum number of samples per leaf.
- Leaf Nodes: Contain final predictions, such as a single class label for classification or an average target value for regression.
Common Criteria for Classification Splits:
Gini Impurity: \[ \text{Gini}(D) = \sum_{c=1}^{C} p(c)(1 - p(c)) \] Entropy: \[ \text{Entropy}(D) = -\sum_{c=1}^{C} p(c)\log_2 p(c) \]
Here, \( p(c) \) is the fraction of items in dataset \( D \) belonging to class \( c \). Lower impurity or entropy means more purity (fewer mixed classes).
Regression Trees:
For regression tasks, decision trees try to minimize variance within subsets: \[ \text{Variance}(D) = \frac{1}{|D|}\sum_{x \in D}(y_x - \bar{y})^2 \] Splits are chosen to create subsets where target values (e.g., house prices) vary less.
What Kind of Datasets Are Decision Trees Good For?
Decision Trees are versatile and can work well on:
- Tabular Data: Where each row is an entity (e.g., a transaction) and columns are features (e.g., product category, time of day, payment method). For example, an online retailer's sales record dataset where each row represents a purchase, and columns are features like product_price, user_age, user_location.
- Real-World Structured Problems:
- Customer Churn Prediction: Predicting whether a telecom customer will leave or stay, using features like monthly charges, contract length, and internet service type.
- Medical Decision Support: Using patient data (age, blood pressure, cholesterol, smoker/non-smoker) to decide if a patient is at high risk of heart disease.
- Quality Control in Manufacturing: Classifying whether a product is defective or not based on features like material density, production line speed, and humidity levels in the factory.
Detailed Example (Toy Classification):
Suppose you want to predict whether a credit card transaction is fraudulent. Your dataset might have features:
- Transaction Amount
- Time of Day
- Location_Match (does the location match the cardholder’s usual location?)
- If Location_Match = No → higher chance of fraud on one branch
- If Location_Match = Yes → go to another branch
- High amount transactions might be very suspicious (fraud leaf)
- Moderate amounts might lead to further splitting on Time of Day
Step-by-Step Summary:
- Collect & Label Data: You have examples with known targets (class labels or numeric values).
- Select a Split Criterion: Gini, Entropy for classification or variance reduction for regression.
- Build the Tree: Recursively split the data into subsets aiming for more purity.
- Stop & Predict: Stop when no improvement can be made or you hit stopping criteria (e.g., max depth), and leaf nodes provide final predictions.
- Use the Tree: For a new instance, follow the decision path down the tree to a leaf, giving your prediction.
ML Engineer Perspective:
- Decision Trees are easy to interpret; you can explain them to non-technical stakeholders by showing the if-then splits.
- No feature scaling or heavy preprocessing required; they naturally handle numerical and categorical data.
- They can overfit if allowed to grow too deep—pruning or setting a max depth can help.
- A good starting model for many tabular datasets: if you have a CSV file of entities (rows) and features (columns), a decision tree can quickly provide a baseline model that’s understandable.
Code Example (Python with scikit-learn):
import numpy as np
from sklearn.tree import DecisionTreeClassifier
# Example: Classifying if a transaction is fraudulent
# Features: [Transaction_Amount, TimeOfDay_Morning(0/1), Location_Match(0/1)]
X = np.array([
[600, 1, 0], # Fraudulent
[20, 1, 1], # Not Fraudulent
[50, 0, 1], # Not Fraudulent
[700, 0, 0], # Fraudulent
[300, 1, 1], # Not Fraudulent
[100, 1, 1] # Not Fraudulent
])
y = np.array(['Fraud', 'NoFraud', 'NoFraud', 'Fraud', 'NoFraud', 'NoFraud'])
model = DecisionTreeClassifier(random_state=42)
model.fit(X, y)
# Predict if a new transaction (amount=450, morning=1, location_match=0) is fraudulent
new_transaction = np.array([[450, 1, 0]])
prediction = model.predict(new_transaction)
print("Predicted Class:", prediction[0])
This example uses a small synthetic dataset of transactions. The decision tree learned splits to identify patterns of fraud.
Key Takeaways:
- Decision Trees split data using features to create a branching structure leading to a final prediction.
- They are intuitive, handle various types of features (numeric, categorical), and require little preprocessing.
- They can overfit if not controlled (prune, limit depth), but provide an easy-to-understand baseline model.
- Suitable for a wide range of tabular, real-world problems: churn prediction, fraud detection, medical risk assessment, and more.