Decision Tree Regressor

Recursive binary splitting based on MSE reduction

• Original MSE from all Y values

• For each feature

• For each possible split (based on X values)

• Split Y values into two groups

• Calculate mean and MSE for each group

• Combine MSEs with weighted average

• Information Gain = Original MSE - Weighted MSE

• Track split with highest gain for this feature

• Select best split across all features

After finding the best split:

Create the split

Use the best split to divide the data into two child nodes

Recursively repeat the process

On each child node:

• For the left child, find the best split among all its data points

• For the right child, find the best split among all its data points

• Each child becomes a new decision node with its own best split

Continue until stopping criteria are met

• Maximum depth reached

• Minimum samples per leaf reached

• Information gain below threshold

• Node becomes "pure enough"

Create leaf nodes

When splitting stops, the node becomes a leaf that predicts the mean of all Y values in that node

Prediction Example:

New data point: X₁=4, X₂=7

Decision Tree Split Analysis - Regressor

Original MSE

Feature X values: [1, 3, 5, 7, 9]

All Y values: [10, 20, 30, 40, 50]

MSE = ((10-30)² + (20-30)² + (30-30)² + (40-30)² + (50-30)²) / 5 = 200

Split 1: X < 1 vs X ≥ 1

Group 1 (X < 1): X = [], Y = [] (empty) → MSE = 0 (undefined/0 by convention)

Group 2 (X ≥ 1): X = [1, 3, 5, 7, 9], Y = [10, 20, 30, 40, 50] → MSE = 200

Weighted MSE = (0/5) × 0 + (5/5) × 200 = 200

Information Gain = 200 - 200 = 0

Split 2: X < 3 vs X ≥ 3

Group 1 (X < 3): X = [1], Y = [10] → MSE = 0

Group 2 (X ≥ 3): X = [3, 5, 7, 9], Y = [20, 30, 40, 50] → MSE = 125

Weighted MSE = (1/5) × 0 + (4/5) × 125 = 100

Information Gain = 200 - 100 = 100

Split 3: X < 5 vs X ≥ 5

Group 1 (X < 5): X = [1, 3], Y = [10, 20] → MSE = 25

Group 2 (X ≥ 5): X = [5, 7, 9], Y = [30, 40, 50] → MSE = 66.67

Weighted MSE = (2/5) × 25 + (3/5) × 66.67 = 10 + 40 = 50

Information Gain = 200 - 50 = 150

Split 4: X < 7 vs X ≥ 7

Group 1 (X < 7): X = [1, 3, 5], Y = [10, 20, 30] → MSE = 66.67

Group 2 (X ≥ 7): X = [7, 9], Y = [40, 50] → MSE = 25

Weighted MSE = (3/5) × 66.67 + (2/5) × 25 = 40 + 10 = 50

Information Gain = 200 - 50 = 150

Split 5: X < 9 vs X ≥ 9

Group 1 (X < 9): X = [1, 3, 5, 7], Y = [10, 20, 30, 40] → MSE = 125

Group 2 (X ≥ 9): X = [9], Y = [50] → MSE = 0

Weighted MSE = (4/5) × 125 + (1/5) × 0 = 100

Information Gain = 200 - 100 = 100

Result

Splits 3 and 4 tie for highest Information Gain (150).
The algorithm would choose one of them (often the first encountered).

Random Forest Regressor

Bootstrap aggregating with feature randomness

• Create N bootstrap samples (with replacement)

• For each bootstrap sample (in parallel)

• Original MSE from Y values in this sample

• For each node split

• Randomly select subset of features (√n_features)

• For each selected feature

• For each possible split (based on X values)

• Split Y values into two groups

• Calculate mean and MSE for each group

• Combine MSEs with weighted average

• Information Gain = Original MSE - Weighted MSE

• Track split with highest gain for selected features

• Select best split and continue building tree

• Average predictions from all N trees

XGBoost Regressor

Gradient boosting with sequential error correction

• Initialize predictions (mean of Y values)

• For each tree (sequentially)

• Calculate residuals = Y - current_predictions

• Original MSE from residuals (not original Y)

• For each feature

• For each possible split (based on X values)

• Split residuals into two groups

• Calculate mean and MSE for each group

• Combine MSEs with weighted average

• Information Gain = Residual MSE - Weighted MSE

• Track split with highest gain for this feature

• Select best split across all features

• Update predictions += learning_rate × tree_prediction

• Final prediction = sum of all tree contributions