Home » #Technology » Decision Tree vs. Random Forest Regression: A Complete Guide with Python Examples

Decision Tree vs. Random Forest Regression: A Complete Guide with Python Examples

When working with regression problems in machine learning, choosing the right algorithm is critical for accuracy and performance. Two of the most popular approaches are Decision Tree Regression and Random Forest Regression. This tech concept will explain how these models work, their differences, and when to use them—with practical Python examples to help you implement them effectively. With ~20 years of experience in tech leadership role, I’ve helped businesses leverage such innovations to drive scalability and success.

What is Decision Tree Regression?

Understanding Decision Tree Regression

Decision Tree Regression is a supervised learning algorithm that predicts continuous values by recursively splitting the dataset into regions. Each split is based on a feature that minimizes the variance in the target variable.

How Decision Tree Regression Works?

  1. The dataset is split into different branches using if-else conditions based on feature values.
  2. Each branch leads to a leaf node that represents a predicted value.
  3. The final output is the average of values in the leaf node.

Example: Predicting House Prices Using Decision Tree Regression

Let’s consider a dataset where house prices depend on:

  • Size of the house (sq ft)
  • Number of bedrooms

We’ll use Decision Tree Regression to predict house prices based on these features.

Python Code for Decision Tree Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample dataset: House Size (sq ft), Number of Bedrooms, Price ($)
data = np.array([
    [1000, 2, 200000],
    [1500, 3, 250000],
    [1800, 3, 280000],
    [2000, 4, 320000],
    [2300, 4, 350000],
    [2500, 4, 400000],
    [2700, 5, 450000],
    [3000, 5, 500000],
    [3500, 6, 600000],
    [4000, 6, 700000]
])

# Split features (X) and target variable (y)
X = data[:, :2]  # First two columns (Size, Bedrooms)
y = data[:, 2]   # Last column (Price)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(max_depth=3)
dt_regressor.fit(X_train, y_train)

# Predict on test data
y_pred = dt_regressor.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Visualization
plt.scatter(X[:, 0], y, color="blue", label="Actual Prices")
plt.scatter(X_test[:, 0], y_pred, color="red", label="Predicted Prices")
plt.xlabel("House Size (sq ft)")
plt.ylabel("Price ($)")
plt.legend()
plt.title("Decision Tree Regression - House Price Prediction")
plt.show()

Advantages of Decision Tree Regression

  • Easy to interpret and visualise
  • Works well with small datasets
  • Can model non-linear relationships

Disadvantages

  • Overfitting when trees are deep
  • Sensitive to small changes in data

What is Random Forest Regression?

Understanding Random Forest Regression

Random Forest Regression is an ensemble learning technique that improves accuracy by combining multiple decision trees. Instead of using one tree, it trains several trees on different parts of the data and averages their outputs.

How Random Forest Regression Works?

  1. The dataset is randomly divided into multiple subsets.
  2. Multiple Decision Trees are trained on these subsets.
  3. The final prediction is the average of all tree predictions.

Example: Predicting Car Prices Using Random Forest Regression

We will predict car prices based on:

  • Year of manufacture
  • Mileage (in km)
  • Engine capacity (in liters)

Python Code for Random Forest Regression

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Sample dataset: Year, Mileage (km), Engine Capacity (L), Price ($)
data = np.array([
    [2015, 60000, 1.5, 12000],
    [2016, 50000, 1.6, 14000],
    [2017, 40000, 1.8, 16000],
    [2018, 30000, 2.0, 18000],
    [2019, 20000, 2.2, 22000],
    [2020, 15000, 2.5, 25000],
    [2021, 10000, 3.0, 30000],
    [2022, 5000, 3.5, 35000],
    [2023, 2000, 4.0, 40000],
    [2024, 1000, 4.5, 45000]
])

# Split features (X) and target variable (y)
X = data[:, :3]  # First three columns (Year, Mileage, Engine Capacity)
y = data[:, 3]   # Last column (Price)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Predict on test data
y_pred = rf_regressor.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Visualization
plt.scatter(X[:, 0], y, color="blue", label="Actual Prices")
plt.scatter(X_test[:, 0], y_pred, color="red", label="Predicted Prices")
plt.xlabel("Year of Manufacture")
plt.ylabel("Price ($)")
plt.legend()
plt.title("Random Forest Regression - Car Price Prediction")
plt.show()

Advantages of Random Forest Regression

  • More accurate than Decision Trees
  • Less overfitting due to averaging
  • Works well with large datasets

Disadvantages

  • Slower than Decision Trees
  • Harder to interpret due to multiple trees

Decision Tree vs. Random Forest Regression: Key Differences

FeatureDecision Tree RegressionRandom Forest Regression
AlgorithmSingle decision treeMultiple decision trees (ensemble)
OverfittingHigh (if deep tree)Low (averaging reduces overfitting)
AccuracyModerateHigher
InterpretabilityEasy to interpretHarder (many trees)
PerformanceFaster but less accurateSlower but more accurate

My Tech Advice: Use Decision Tree Regression when interpretability and speed are important. Use Random Forest Regression when accuracy and robustness are priorities. Both models are powerful tools for regression tasks. Try them on your dataset and choose the best fit for your problem!

#AskDushyant
Note: The example and pseudo code is for illustration only. You must modify and experiment with the concept to meet your specific needs.
#TechConcept #TechAdvice #AI #ML #Python #ModelTuning

Leave a Reply

Your email address will not be published. Required fields are marked *