Decision Tree Regressor

Recursive binary splitting based on MSE reduction

Original MSE from all Y values
For each feature
For each possible split (based on X values)
Split Y values into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Original MSE - Weighted MSE
Track split with highest gain for this feature
Select best split across all features

After finding the best split:

Create the split
Use the best split to divide the data into two child nodes
Recursively repeat the process
On each child node:
• For the left child, find the best split among all its data points
• For the right child, find the best split among all its data points
• Each child becomes a new decision node with its own best split
Continue until stopping criteria are met
• Maximum depth reached
• Minimum samples per leaf reached
• Information gain below threshold
• Node becomes "pure enough"
Create leaf nodes
When splitting stops, the node becomes a leaf that predicts the mean of all Y values in that node

Prediction Example:

New data point: X₁=4, X₂=7

Decision Tree Split Analysis - Regressor

Original MSE

Feature X values: [1, 3, 5, 7, 9]

All Y values: [10, 20, 30, 40, 50]

MSE = ((10-30)² + (20-30)² + (30-30)² + (40-30)² + (50-30)²) / 5 = 200

Split 1: X < 1 vs X ≥ 1
Group 1 (X < 1): X = [], Y = [] (empty) → MSE = 0 (undefined/0 by convention)
Group 2 (X ≥ 1): X = [1, 3, 5, 7, 9], Y = [10, 20, 30, 40, 50] → MSE = 200
Weighted MSE = (0/5) × 0 + (5/5) × 200 = 200
Information Gain = 200 - 200 = 0
Split 2: X < 3 vs X ≥ 3
Group 1 (X < 3): X = [1], Y = [10] → MSE = 0
Group 2 (X ≥ 3): X = [3, 5, 7, 9], Y = [20, 30, 40, 50] → MSE = 125
Weighted MSE = (1/5) × 0 + (4/5) × 125 = 100
Information Gain = 200 - 100 = 100
Split 3: X < 5 vs X ≥ 5
Group 1 (X < 5): X = [1, 3], Y = [10, 20] → MSE = 25
Group 2 (X ≥ 5): X = [5, 7, 9], Y = [30, 40, 50] → MSE = 66.67
Weighted MSE = (2/5) × 25 + (3/5) × 66.67 = 10 + 40 = 50
Information Gain = 200 - 50 = 150
Split 4: X < 7 vs X ≥ 7
Group 1 (X < 7): X = [1, 3, 5], Y = [10, 20, 30] → MSE = 66.67
Group 2 (X ≥ 7): X = [7, 9], Y = [40, 50] → MSE = 25
Weighted MSE = (3/5) × 66.67 + (2/5) × 25 = 40 + 10 = 50
Information Gain = 200 - 50 = 150
Split 5: X < 9 vs X ≥ 9
Group 1 (X < 9): X = [1, 3, 5, 7], Y = [10, 20, 30, 40] → MSE = 125
Group 2 (X ≥ 9): X = [9], Y = [50] → MSE = 0
Weighted MSE = (4/5) × 125 + (1/5) × 0 = 100
Information Gain = 200 - 100 = 100

Result

Splits 3 and 4 tie for highest Information Gain (150).
The algorithm would choose one of them (often the first encountered).

Random Forest Regressor

Bootstrap aggregating with feature randomness

Create N bootstrap samples (with replacement)
For each bootstrap sample (in parallel)
Original MSE from Y values in this sample
For each node split
Randomly select subset of features (√n_features)
For each selected feature
For each possible split (based on X values)
Split Y values into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Original MSE - Weighted MSE
Track split with highest gain for selected features
Select best split and continue building tree
Average predictions from all N trees

XGBoost Regressor

Gradient boosting with sequential error correction

Initialize predictions (mean of Y values)
For each tree (sequentially)
Calculate residuals = Y - current_predictions
Original MSE from residuals (not original Y)
For each feature
For each possible split (based on X values)
Split residuals into two groups
Calculate mean and MSE for each group
Combine MSEs with weighted average
Information Gain = Residual MSE - Weighted MSE
Track split with highest gain for this feature
Select best split across all features
Update predictions += learning_rate × tree_prediction
Final prediction = sum of all tree contributions
Decision Tree Visualization