•
Original MSE from all Y values
•
For each feature
•
For each possible split (based on X values)
•
Split Y values into two groups
•
Calculate mean and MSE for each group
•
Combine MSEs with weighted average
•
Information Gain = Original MSE - Weighted MSE
•
Track split with highest gain for this feature
•
Select best split across all features
After finding the best split:
Create the split
Use the best split to divide the data into two child nodes
Recursively repeat the process
On each child node:
• For the left child, find the best split among all its data points
• For the right child, find the best split among all its data points
• Each child becomes a new decision node with its own best split
Continue until stopping criteria are met
• Maximum depth reached
• Minimum samples per leaf reached
• Information gain below threshold
• Node becomes "pure enough"
Create leaf nodes
When splitting stops, the node becomes a leaf that predicts the mean of all Y values in that node
Prediction Example:
New data point: X₁=4, X₂=7
Tree navigation:
• Root: "X₁ < 5?" → YES (4 < 5) → go left
• Left node: "X₂ < 6?" → NO (7 ≥ 6) → go right
• Right node: LEAF → contains mean of training Y values [15, 25, 35]
• Prediction: Y = 25 (the mean)
Decision Tree Split Analysis - Regressor
Original MSE
Feature X values: [1, 3, 5, 7, 9]
All Y values: [10, 20, 30, 40, 50]
MSE = ((10-30)² + (20-30)² + (30-30)² + (40-30)² + (50-30)²) / 5 = 200
Split 1: X < 1 vs X ≥ 1
Group 1 (X < 1): X = [], Y = [] (empty) → MSE = 0 (undefined/0 by convention)
Group 2 (X ≥ 1): X = [1, 3, 5, 7, 9], Y = [10, 20, 30, 40, 50] → MSE = 200
Weighted MSE = (0/5) × 0 + (5/5) × 200 = 200
Information Gain = 200 - 200 = 0
Split 2: X < 3 vs X ≥ 3
Group 1 (X < 3): X = [1], Y = [10] → MSE = 0
Group 2 (X ≥ 3): X = [3, 5, 7, 9], Y = [20, 30, 40, 50] → MSE = 125
Weighted MSE = (1/5) × 0 + (4/5) × 125 = 100
Information Gain = 200 - 100 = 100
Split 3: X < 5 vs X ≥ 5
Group 1 (X < 5): X = [1, 3], Y = [10, 20] → MSE = 25
Group 2 (X ≥ 5): X = [5, 7, 9], Y = [30, 40, 50] → MSE = 66.67
Weighted MSE = (2/5) × 25 + (3/5) × 66.67 = 10 + 40 = 50
Information Gain = 200 - 50 = 150
Split 4: X < 7 vs X ≥ 7
Group 1 (X < 7): X = [1, 3, 5], Y = [10, 20, 30] → MSE = 66.67
Group 2 (X ≥ 7): X = [7, 9], Y = [40, 50] → MSE = 25
Weighted MSE = (3/5) × 66.67 + (2/5) × 25 = 40 + 10 = 50
Information Gain = 200 - 50 = 150
Split 5: X < 9 vs X ≥ 9
Group 1 (X < 9): X = [1, 3, 5, 7], Y = [10, 20, 30, 40] → MSE = 125
Group 2 (X ≥ 9): X = [9], Y = [50] → MSE = 0
Weighted MSE = (4/5) × 125 + (1/5) × 0 = 100
Information Gain = 200 - 100 = 100
Result
Splits 3 and 4 tie for highest Information Gain (150).
The algorithm would choose one of them (often the first encountered).
•
Create N bootstrap samples (with replacement)
•
For each bootstrap sample (in parallel)
•
Original MSE from Y values in this sample
•
For each node split
•
Randomly select subset of features (√n_features)
•
For each selected feature
•
For each possible split (based on X values)
•
Split Y values into two groups
•
Calculate mean and MSE for each group
•
Combine MSEs with weighted average
•
Information Gain = Original MSE - Weighted MSE
•
Track split with highest gain for selected features
•
Select best split and continue building tree
•
Average predictions from all N trees
•
Initialize predictions (mean of Y values)
•
For each tree (sequentially)
•
Calculate residuals = Y - current_predictions
•
Original MSE from residuals (not original Y)
•
For each feature
•
For each possible split (based on X values)
•
Split residuals into two groups
•
Calculate mean and MSE for each group
•
Combine MSEs with weighted average
•
Information Gain = Residual MSE - Weighted MSE
•
Track split with highest gain for this feature
•
Select best split across all features
•
Update predictions += learning_rate × tree_prediction
•
Final prediction = sum of all tree contributions