Tree Methods

Regression Models

Decision Tree Regressor

Recursive binary splitting based on MSE reduction

Original MSE from all Y values
For each feature
For each possible split (based on X values)
Split Y values into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Original MSE - Weighted MSE
Track split with highest gain for this feature
Select best split across all features
!steps

After finding the best split:

Create the split
Use the best split to divide the data into two child nodes
Recursively repeat the process
On each child node:
• For the left child, find the best split among all its data points
• For the right child, find the best split among all its data points
• Each child becomes a new decision node with its own best split
Continue until stopping criteria are met
• Maximum depth reached
• Minimum samples per leaf reached
• Information gain below threshold
• Node becomes "pure enough"
Create leaf nodes
When splitting stops, the node becomes a leaf that predicts the mean of all Y values in that node

Prediction Example:

New data point: X₁=4, X₂=7

Decision Tree Split Analysis - Regressor

Original MSE

Feature X values: [1, 3, 5, 7, 9]

All Y values: [10, 20, 30, 40, 50]

MSE = ((10-30)² + (20-30)² + (30-30)² + (40-30)² + (50-30)²) / 5 = 200

Split 1: X < 1 vs X ≥ 1
Group 1 (X < 1): X = [], Y = [] (empty) → MSE = 0 (undefined/0 by convention)
Group 2 (X ≥ 1): X = [1, 3, 5, 7, 9], Y = [10, 20, 30, 40, 50] → MSE = 200
Weighted MSE = (0/5) × 0 + (5/5) × 200 = 200
Information Gain = 200 - 200 = 0
Split 2: X < 3 vs X ≥ 3
Group 1 (X < 3): X = [1], Y = [10] → MSE = 0
Group 2 (X ≥ 3): X = [3, 5, 7, 9], Y = [20, 30, 40, 50] → MSE = 125
Weighted MSE = (1/5) × 0 + (4/5) × 125 = 100
Information Gain = 200 - 100 = 100
Split 3: X < 5 vs X ≥ 5
Group 1 (X < 5): X = [1, 3], Y = [10, 20] → MSE = 25
Group 2 (X ≥ 5): X = [5, 7, 9], Y = [30, 40, 50] → MSE = 66.67
Weighted MSE = (2/5) × 25 + (3/5) × 66.67 = 10 + 40 = 50
Information Gain = 200 - 50 = 150
Split 4: X < 7 vs X ≥ 7
Group 1 (X < 7): X = [1, 3, 5], Y = [10, 20, 30] → MSE = 66.67
Group 2 (X ≥ 7): X = [7, 9], Y = [40, 50] → MSE = 25
Weighted MSE = (3/5) × 66.67 + (2/5) × 25 = 40 + 10 = 50
Information Gain = 200 - 50 = 150
Split 5: X < 9 vs X ≥ 9
Group 1 (X < 9): X = [1, 3, 5, 7], Y = [10, 20, 30, 40] → MSE = 125
Group 2 (X ≥ 9): X = [9], Y = [50] → MSE = 0
Weighted MSE = (4/5) × 125 + (1/5) × 0 = 100
Information Gain = 200 - 100 = 100

Result

Splits 3 and 4 tie for highest Information Gain (150).
The algorithm would choose one of them (often the first encountered).

Random Forest Regressor

Bootstrap aggregating with feature randomness

Create N bootstrap samples (with replacement)
For each bootstrap sample (in parallel)
Original MSE from Y values in this sample
For each node split
Randomly select subset of features (√n_features)
For each selected feature
For each possible split (based on X values)
Split Y values into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Original MSE - Weighted MSE
Track split with highest gain for selected features
Select best split and continue building tree
Average predictions from all N trees

XGBoost Regressor

Gradient boosting with sequential error correction

Initialize predictions (mean of Y values)
For each tree (sequentially)
Calculate residuals = Y - current_predictions
Original MSE from residuals (not original Y)
For each feature
For each possible split (based on X values)
Split residuals into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Residual MSE - Weighted MSE
Track split with highest gain for this feature
Select best split across all features
Update predictions += learning_rate × tree_prediction
Final prediction = sum of all tree contributions
Decision Tree Visualization

Classification Models

Decision Tree Classifier

Recursive binary splitting based on Gini impurity reduction

Data Preprocessing:

Convert continuous Y values [10, 20, 30, 40, 50] into categorical classes: ["low", "low", "medium", "high", "high"]

Original Gini impurity from all Y classes
For each feature
For each possible split (based on X values)
Split Y classes into two groups
Calculate class frequencies and Gini impurity for each group
Combine Gini impurities with weighted average
Information Gain = Original Gini - Weighted Gini
Track split with highest gain for this feature
Select best split across all features

After finding the best split:

Create the split
Use the best split to divide the data into two child nodes
Recursively repeat the process
On each child node:
• For the left child, find the best split among all its data points
• For the right child, find the best split among all its data points
• Each child becomes a new decision node with its own best split
Continue until stopping criteria are met
• Maximum depth reached
• Minimum samples per leaf reached
• Information gain below threshold
• Node becomes "pure enough"
Create leaf nodes
When splitting stops, the node becomes a leaf that predicts the most frequent class of all Y values in that node

Prediction Example:

New data point: X₁=4, X₂=7

Decision Tree Split Analysis - Classifier

Data Preprocessing

Convert continuous Y values [10, 20, 30, 40, 50] into categorical classes: ["low", "low", "medium", "high", "high"]

Original Gini Impurity

Feature X values: [1, 3, 5, 7, 9]

All Y classes: ["low", "low", "medium", "high", "high"]

Class frequencies: "low"=2, "medium"=1, "high"=2

Class probabilities: P("low")=2/5, P("medium")=1/5, P("high")=2/5

Gini = 1 - (2/5)² - (1/5)² - (2/5)² = 1 - 0.16 - 0.04 - 0.16 = 0.64

Split 1: X < 1 vs X ≥ 1
Group 1 (X < 1): X = [], Y = [] (empty) → Gini = 0 (undefined/0 by convention)
Group 2 (X ≥ 1): X = [1, 3, 5, 7, 9], Y = ["low", "low", "medium", "high", "high"] → Gini = 0.64
Weighted Gini = (0/5) × 0 + (5/5) × 0.64 = 0.64
Information Gain = 0.64 - 0.64 = 0
Split 2: X < 3 vs X ≥ 3
Group 1 (X < 3): X = [1], Y = ["low"] → Gini = 0
Group 2 (X ≥ 3): X = [3, 5, 7, 9], Y = ["low", "medium", "high", "high"] → Gini = 0.625
Weighted Gini = (1/5) × 0 + (4/5) × 0.625 = 0.5
Information Gain = 0.64 - 0.5 = 0.14
Split 3: X < 5 vs X ≥ 5
Group 1 (X < 5): X = [1, 3], Y = ["low", "low"] → Gini = 0
Group 2 (X ≥ 5): X = [5, 7, 9], Y = ["medium", "high", "high"] → Gini = 0.444
Weighted Gini = (2/5) × 0 + (3/5) × 0.444 = 0.267
Information Gain = 0.64 - 0.267 = 0.373
Split 4: X < 7 vs X ≥ 7
Group 1 (X < 7): X = [1, 3, 5], Y = ["low", "low", "medium"] → Gini = 0.444
Group 2 (X ≥ 7): X = [7, 9], Y = ["high", "high"] → Gini = 0
Weighted Gini = (3/5) × 0.444 + (2/5) × 0 = 0.267
Information Gain = 0.64 - 0.267 = 0.373
Split 5: X < 9 vs X ≥ 9
Group 1 (X < 9): X = [1, 3, 5, 7], Y = ["low", "low", "medium", "high"] → Gini = 0.625
Group 2 (X ≥ 9): X = [9], Y = ["high"] → Gini = 0
Weighted Gini = (4/5) × 0.625 + (1/5) × 0 = 0.5
Information Gain = 0.64 - 0.5 = 0.14

Result

Splits 3 and 4 tie for highest Information Gain (0.373).
The algorithm would choose one of them (often the first encountered).

Random Forest Classifier

Bootstrap aggregating with majority voting

Create N bootstrap samples (with replacement)
For each bootstrap sample (in parallel)
Original Gini Impurity from Y values in this sample
For each node split
Randomly select subset of features (√n_features)
For each selected feature
For each possible split (based on X values)
Split Y values into two groups
Calculate class counts and Gini for each group
Combine Gini scores with weighted average
Information Gain = Original Gini - Weighted Gini
Track split with highest gain for selected features
Select best split and continue building tree
Majority vote from all N trees

XGBoost Classifier

Gradient boosting for classification tasks

Initialize predictions (class probabilities)
For each tree (sequentially)
Calculate probability residuals (gradients)
Original impurity from residuals (not original Y)
For each feature
For each possible split (based on X values)
Split residuals into two groups
Calculate gain metric for each group
Combine scores with weighted average
Information Gain = Residual impurity - Weighted impurity
Track split with highest gain for this feature
Select best split across all features
Update probabilities += learning_rate × tree_prediction
Final prediction = class with highest probability
Decision Tree Classifier Visualization