Overfitting vs Underfitting in Machine Learning: Common Mistakes and Python Solutions (2026)

Overfitting vs underfitting in machine learning illustration showing two graphs: a complex curve representing overfitting and a straight line representing underfitting, with a balanced “good fit” model in the center explaining model generalization.

Introduction

Machine learning models are designed to learn patterns from data and make accurate predictions on new, unseen information. However, one of the most common challenges beginners encounter is balancing model complexity and performance. This is where the concepts of overfitting vs underfitting in machine learning become critically important.

Think about the Goldilocks principle: not too simple, not too complex — but just right. Machine learning models follow a similar idea. If a model is too simple, it fails to learn important patterns from the dataset. If it is too complex, it memorizes the training data instead of learning the underlying patterns.

These two problems are known as:

  • Underfitting – when a model is too simple to capture the structure of the data.
  • Overfitting – when a model learns the noise in the data instead of the real patterns.

Understanding the difference between overfitting vs underfitting in machine learning is essential if you want to build models that perform well on real-world data.

Many beginner machine learning projects fail not because the algorithms are wrong, but because these two issues are ignored.

In this beginner-friendly Python guide, you will learn:

  • What overfitting and underfitting mean in machine learning
  • Why model generalization matters
  • How to detect these problems using training and testing performance
  • How to demonstrate overfitting and underfitting using Python and scikit-learn
  • Practical solutions to prevent these problems

By the end of this guide, you will clearly understand how to recognize and fix overfitting and underfitting in machine learning models using Python.

If you are new to machine learning pipelines, you may want to first understand the Machine Learning Workflow in Python before diving deeper into model performance issues.


Prerequisites

Before continuing, you should have basic knowledge of:

  • Python programming
  • NumPy and Pandas
  • Basic machine learning concepts
  • The scikit-learn library

If you are new to machine learning workflows, you may want to read our guide on Machine Learning Workflow in Python before continuing.


Understanding the Basics

Before we dive into Python implementations, it is important to understand the fundamental concepts behind overfitting vs underfitting in machine learning.


What is Model Generalization?

In machine learning, the ultimate goal is not simply to perform well on training data. Instead, the goal is to create models that generalize well to new data.

Machine learning models are usually evaluated using a training dataset and a testing dataset created using techniques like Train Test Split in Python for Machine Learning.

Generalization refers to a model’s ability to make accurate predictions on unseen data.

When a machine learning model is trained, the dataset is typically divided into two parts:

  • Training dataset – used to train the model
  • Testing dataset – used to evaluate the model’s performance

This process is commonly done using train-test splitting.

If you want a detailed explanation of this process, check our guide on Train Test Split in Python for Machine Learning.

A well-performing model should have:

  • Low training error
  • Low testing error

However, when models become too simple or too complex, problems appear.

These problems are known as:

  • Underfitting
  • Overfitting

The Bias-Variance Tradeoff

To fully understand overfitting vs underfitting in machine learning, we must understand the bias-variance tradeoff.

The relationship between model complexity, bias, and variance is known as the bias-variance tradeoff explained in machine learning theory.

This concept explains why machine learning models sometimes fail to generalize.

Bias

Bias represents the error introduced by overly simple assumptions in a model.

Models with high bias:

  • Are too simple
  • Fail to capture important patterns
  • Lead to underfitting

Variance

Variance represents the model’s sensitivity to small fluctuations in the training data.

Models with high variance:

  • Learn the training data too closely
  • Capture noise instead of patterns
  • Lead to overfitting

The Tradeoff

The challenge in machine learning is finding the right balance:

Model TypeBiasVariance
UnderfittingHighLow
Good FitBalancedBalanced
OverfittingLowHigh

The best machine learning models achieve the right balance between bias and variance.

Educational infographic showing methods to prevent overfitting in machine learning. Include icons for cross validation, regularization, more training data, feature selection, and early stopping. Use arrows pointing to improved model generalization. Modern data science infographic style with minimal text and clean icons.

Underfitting in Machine Learning

Let’s first understand underfitting, one side of the overfitting vs underfitting in machine learning problem.


What is Underfitting?

Underfitting occurs when a machine learning model is too simple to learn the underlying structure of the data.

This means the model fails to capture meaningful patterns.

As a result, the model performs poorly on both training data and testing data.

In simple terms:

The model has not learned enough from the data.


Characteristics of Underfitting

Underfitting models usually have the following characteristics:

  • High training error
  • High testing error
  • Poor predictive performance
  • High bias
  • Low variance

This happens when the model is too simple for the complexity of the dataset.


Example of Underfitting

Imagine you are trying to predict house prices using only one feature, such as the number of rooms.

But the real factors affecting house prices include:

  • Location
  • Size
  • Neighborhood quality
  • Distance from city center
  • Market trends

Using a very simple model in this case would result in underfitting because the model ignores important information.


Visual Analogy

A common way to understand underfitting is by imagining a straight line trying to fit curved data.

If the dataset follows a curved pattern but the model only fits a straight line, the model will miss important trends.

This leads to high prediction errors.


Overfitting in Machine Learning

Now let’s examine the opposite problem in overfitting vs underfitting in machine learning.


What is Overfitting?

Overfitting occurs when a machine learning model learns the training data too well, including noise and random fluctuations.

Instead of learning general patterns, the model memorizes the training dataset.

As a result:

  • Training performance becomes extremely good
  • Testing performance becomes poor

This means the model cannot generalize to new data.


Characteristics of Overfitting

Overfitting models usually show the following behavior:

  • Very low training error
  • High testing error
  • High variance
  • Poor generalization
  • Model complexity is too high

Example of Overfitting

Imagine a student preparing for an exam by memorizing every practice question.

If the real exam questions are different, the student may struggle because they memorized answers instead of understanding the concepts.

Machine learning models behave in the same way.

When models memorize training data, they fail on new data.


Visual Analogy

In a graph, overfitting appears as a very complex curve passing through almost every training point.

While the curve perfectly matches the training dataset, it becomes extremely sensitive to noise.

This results in poor predictions on unseen data.


Comparing Overfitting vs Underfitting in Machine Learning

Understanding the differences between these two problems helps you diagnose machine learning models quickly.

ProblemTraining AccuracyTesting AccuracyCause
UnderfittingLowLowModel too simple
Good FitHighHighBalanced complexity
OverfittingVery HighLowModel too complex

The goal of machine learning is always to achieve the good fit scenario.

Overfitting vs underfitting in machine learning infographic showing underfitting, good fit, and overfitting model examples with data curves.

Python Setup and Dataset Preparation

Now that we understand the theory behind overfitting vs underfitting in machine learning, let’s demonstrate these problems using Python.

The examples in this tutorial use the scikit-learn official documentation to implement machine learning models in Python.

We will create a simple dataset and train different models to observe how underfitting and overfitting occur.


Required Libraries

First, import the necessary Python libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

These libraries allow us to:

  • Generate synthetic datasets
  • Train machine learning models
  • Visualize results
  • Evaluate model performance

Creating a Synthetic Dataset

To clearly demonstrate overfitting vs underfitting in machine learning, we will create a nonlinear dataset.

# Generate nonlinear data
np.random.seed(42)

X = np.sort(np.random.rand(100,1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.normal(0,0.1,X.shape[0])

This dataset follows a sine curve pattern, which is useful for demonstrating model complexity.


Train-Test Split

Next, we divide the dataset into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Here:

  • 70% of the data is used for training
  • 30% is used for testing

This allows us to evaluate how well the model generalizes to unseen data.


Demonstrating Underfitting in Python

Let’s begin by creating a very simple model.

This will demonstrate underfitting.


Linear Model (Degree 1)

model_under = LinearRegression()

model_under.fit(X_train, y_train)

y_train_pred = model_under.predict(X_train)
y_test_pred = model_under.predict(X_test)

train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print("Training Error:", train_error)
print("Testing Error:", test_error)

Because the data follows a nonlinear sine pattern, a simple linear regression model will struggle to capture the relationship.

As a result:

  • Training error will be high
  • Testing error will also be high

This is a classic example of underfitting in machine learning.


Visualizing the Underfitting Model

You can visualize the results using Matplotlib.

plt.scatter(X, y)
plt.plot(X, model_under.predict(X), color='red')
plt.title("Underfitting Example")
plt.show()

The straight red line fails to follow the curve of the dataset.

This confirms that the model is too simple to learn the pattern.


Demonstrating a Good Fit Model

After seeing an underfitting model, the next step is to create a model that properly captures the structure of the dataset.

The dataset we generated earlier follows a sine curve pattern. A linear model cannot capture that curve, but a polynomial regression model can.

Polynomial regression increases model complexity by adding polynomial features.Polynomial models allow algorithms to capture nonlinear relationships, which is explained in detail in polynomial regression in machine learning.


Polynomial Model (Degree 4)

Let’s build a polynomial regression model that is complex enough to capture the curve but not too complex to cause overfitting.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Create polynomial features
poly_features = PolynomialFeatures(degree=4, include_bias=False)

X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)

# Train model
model_good = LinearRegression()
model_good.fit(X_poly_train, y_train)

# Predictions
y_train_pred = model_good.predict(X_poly_train)
y_test_pred = model_good.predict(X_poly_test)

# Calculate errors
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print("Training Error:", train_error)
print("Testing Error:", test_error)

In this case, the model is complex enough to capture the sine curve pattern but not too complex to memorize noise.

You will usually see:

  • Low training error
  • Low testing error

This indicates a well-balanced model.


Visualization of a Good Fit

To understand this better, let’s visualize the model.

plt.scatter(X, y)

X_range = np.linspace(0,10,100).reshape(-1,1)
X_range_poly = poly_features.transform(X_range)

plt.plot(X_range, model_good.predict(X_range_poly), color='green')

plt.title("Good Fit Model")
plt.show()

The curve now follows the structure of the dataset smoothly.

This demonstrates the ideal scenario in machine learning — the model captures patterns but ignores random noise.

This is the goal when solving overfitting vs underfitting in machine learning problems.


Demonstrating Overfitting in Python

Now let’s move to the opposite extreme.

Instead of making the model slightly more complex, we will make it extremely complex.

This will produce overfitting.


High-Degree Polynomial Model

We will create a polynomial model with degree 15.

poly_features = PolynomialFeatures(degree=15, include_bias=False)

X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)

model_over = LinearRegression()
model_over.fit(X_poly_train, y_train)

y_train_pred = model_over.predict(X_poly_train)
y_test_pred = model_over.predict(X_poly_test)

train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print("Training Error:", train_error)
print("Testing Error:", test_error)

In most cases you will observe:

  • Extremely low training error
  • High testing error

The model has essentially memorized the training dataset.


Visualization of Overfitting

plt.scatter(X, y)

X_range = np.linspace(0,10,100).reshape(-1,1)
X_range_poly = poly_features.transform(X_range)

plt.plot(X_range, model_over.predict(X_range_poly), color='purple')

plt.title("Overfitting Example")
plt.show()

Instead of a smooth curve, the model produces a very wiggly line.

This line tries to pass through almost every training point.

While this looks impressive on training data, it performs poorly on new data.

This is the core problem in overfitting vs underfitting in machine learning.


Understanding Learning Curves

Learning curves are one of the most powerful tools for diagnosing machine learning models.

They show how model performance changes as the training dataset size increases.

Learning curves help identify:

  • Underfitting
  • Overfitting
  • Balanced models

Creating Learning Curves in Python

We can use scikit-learn’s learning_curve function.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, validation_scores = learning_curve(
    model_good,
    X,
    y,
    cv=5,
    scoring='neg_mean_squared_error',
    train_sizes=np.linspace(0.1,1.0,10)
)

This function evaluates model performance on different dataset sizes.


Plotting Learning Curves

train_scores_mean = -train_scores.mean(axis=1)
validation_scores_mean = -validation_scores.mean(axis=1)

plt.plot(train_sizes, train_scores_mean, label="Training Error")
plt.plot(train_sizes, validation_scores_mean, label="Validation Error")

plt.xlabel("Training Size")
plt.ylabel("Error")
plt.title("Learning Curve")
plt.legend()

plt.show()

This plot provides critical insights into overfitting vs underfitting in machine learning.


Interpreting Learning Curves

Let’s look at how learning curves help diagnose problems.


Learning Curve for Underfitting

In an underfitting model:

  • Training error is high
  • Validation error is also high
  • Both curves converge

This means the model is too simple to learn the data patterns.


Learning Curve for Overfitting

In an overfitting model:

  • Training error is extremely low
  • Validation error is much higher
  • Large gap between curves

This indicates the model is memorizing training data instead of generalizing.


Learning Curve for Good Fit

In a well-trained model:

  • Training error is low
  • Validation error is also low
  • Curves converge closely

This indicates a balanced model.


Detecting Overfitting and Underfitting

Understanding theory is useful, but in real machine learning projects we must diagnose these problems quickly.

Here are the most common signs.


Key Signs in Practice

SymptomTraining AccuracyValidation AccuracyDiagnosis
Both lowLowLowUnderfitting
Both highHighHighGood Fit
Large gapHighLowOverfitting

This simple table is widely used by machine learning engineers.

It helps quickly identify overfitting vs underfitting in machine learning models.


Using Cross-Validation

Cross-validation is another powerful technique for detecting model issues.

Instead of using a single train-test split, the dataset is divided into multiple folds.

The model is trained and evaluated several times.


Cross-Validation Example in Python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model_good,
    X,
    y,
    cv=5,
    scoring='neg_mean_squared_error'
)

print("Cross Validation Scores:", scores)
print("Average Score:", scores.mean())

Cross-validation provides:

  • More reliable model evaluation
  • Reduced risk of misleading results
  • Better detection of overfitting

Because of these benefits, cross-validation is a standard practice in machine learning.


Why Beginners Often Face Overfitting

Many beginner machine learning projects suffer from overfitting.

Common reasons include:

  • Using very complex models
  • Training on small datasets
  • Ignoring regularization
  • Not using cross-validation
  • Testing the model on the same data used for training

Understanding overfitting vs underfitting in machine learning helps prevent these mistakes.


Why Underfitting Happens

Underfitting usually happens when models are too simple.

Common causes include:

  • Using linear models for nonlinear data
  • Removing too many features
  • Excessive regularization
  • Insufficient training time

While underfitting is easier to fix than overfitting, it still leads to poor predictions.


Solutions for Underfitting

Underfitting happens when the model is too simple to capture the structure of the data.

Fortunately, underfitting is usually easier to fix than overfitting.

Below are the most effective solutions.


1. Increase Model Complexity

The most common reason for underfitting is that the model is too simple.

For example:

  • Linear regression may not capture nonlinear relationships.
  • Decision trees with very small depth may miss important patterns.

Increasing model complexity can solve this problem.

Examples include:

  • Using polynomial regression
  • Increasing tree depth
  • Using more advanced algorithms

2. Add More Features

Sometimes the model fails because the dataset lacks important information.

For example, predicting house prices using only one feature may cause underfitting.

Adding relevant features improves the model’s ability to learn patterns.

Examples of useful features:

  • Location data
  • Historical trends
  • Derived features (feature engineering)

Feature engineering is a powerful tool in machine learning.


3. Reduce Regularization

Regularization techniques prevent overfitting by penalizing large model weights.

However, excessive regularization may make the model too simple.

If regularization strength is too high, the model may suffer from underfitting.

Reducing regularization strength can improve model performance.


4. Train the Model Longer

Some algorithms, especially neural networks, require longer training time.

Stopping the training process too early may prevent the model from learning useful patterns.

Increasing the number of training iterations can sometimes fix underfitting.

One effective way to reduce underfitting is through Feature Engineering in Python for Machine Learning, which helps models capture more meaningful patterns.


Solutions for Overfitting

Overfitting is more challenging because the model already performs well on training data but poorly on new data.

Below are the most effective solutions used in machine learning.

Machine learning overfitting prevention infographic showing cross validation, regularization, more data, and feature selection techniques.

1. Use Cross-Validation

Cross-validation evaluates the model on multiple dataset splits.

This reduces the risk of misleading results caused by a single train-test split.

Cross-validation ensures that the model performs consistently across different subsets of data.

Example in Python:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model_good,
    X,
    y,
    cv=5,
    scoring='neg_mean_squared_error'
)

print("Average CV Score:", scores.mean())

Cross-validation is considered a standard practice in machine learning workflows.


2. Regularization Techniques

Regularization reduces model complexity by penalizing large coefficients.

Two common regularization techniques are:

  • Ridge Regression
  • Lasso Regression

These methods are widely used to control overfitting in machine learning models.

Regularization methods such as Ridge and Lasso regression explained help control model complexity and reduce overfitting.


Ridge Regression Example

Ridge regression adds a penalty to large model weights.

This helps prevent models from becoming overly complex.

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=1.0)

ridge_model.fit(X_train, y_train)

predictions = ridge_model.predict(X_test)

print("Ridge Model Predictions:", predictions[:5])

The parameter alpha controls regularization strength.

Higher alpha values increase regularization.


Lasso Regression Example

Lasso regression performs regularization and feature selection simultaneously.

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.1)

lasso_model.fit(X_train, y_train)

predictions = lasso_model.predict(X_test)

print("Lasso Predictions:", predictions[:5])

Lasso is particularly useful when dealing with high-dimensional datasets.


3. Collect More Training Data

One of the most effective ways to reduce overfitting is simply using more data.

With larger datasets:

  • Models are less likely to memorize noise
  • Patterns become easier to identify
  • Variance decreases

In real-world machine learning projects, collecting more data is often the best solution.


4. Reduce Model Complexity

Another approach is to simplify the model.

Examples include:

  • Reducing polynomial degree
  • Limiting decision tree depth
  • Using fewer features

Simpler models are less likely to memorize training data.


5. Data Augmentation

In some fields like computer vision, collecting data may be difficult.

In such cases, data augmentation can help.

Examples include:

  • Rotating images
  • Flipping images
  • Adding noise
  • Scaling images

These techniques increase dataset diversity and reduce overfitting.


6. Early Stopping

Early stopping is commonly used in deep learning.

During training, model performance on validation data is monitored.

Training stops when validation performance begins to degrade.

This prevents the model from learning noise in the dataset.


Real-World Example: California Housing Dataset

Now let’s apply these concepts to a real dataset.

We will use the California Housing dataset, which is available in scikit-learn.


Loading the Dataset

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()

X = data.data
y = data.target

This dataset contains features such as:

  • Median income
  • House age
  • Average rooms
  • Population
  • Location coordinates

The target variable is the median house value.


Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Training a Linear Regression Model

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

predictions = model.predict(X_test)

Evaluating Model Performance

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, predictions)

print("Mean Squared Error:", mse)

This gives us a baseline model.

From here, we can experiment with:

  • Regularization
  • Feature engineering
  • Model complexity

These techniques help solve overfitting vs underfitting in machine learning models.


Practical Tips for Machine Learning Projects

Below are some best practices used by professional data scientists.


Start with Simple Models

Always begin with simple algorithms like:

  • Linear Regression
  • Logistic Regression
  • Decision Trees

These models help establish a baseline.

Once the baseline is understood, you can move to more complex algorithms.


Monitor Training and Validation Performance

Never rely only on training accuracy.

Always evaluate models on validation or testing datasets.

Large differences between training and validation performance often indicate overfitting.


Use Cross-Validation Regularly

Cross-validation reduces bias in model evaluation.

Most professional machine learning pipelines include cross-validation as a standard step.


Keep a Separate Test Dataset

The test dataset should only be used once at the final stage of evaluation.

Using the test dataset during training may lead to misleading results.


Quick Reference Guide

Here is a simple summary of overfitting vs underfitting in machine learning.

ProblemCauseSolution
UnderfittingModel too simpleIncrease complexity
OverfittingModel too complexAdd regularization
OverfittingSmall datasetCollect more data
UnderfittingMissing featuresFeature engineering

This table is useful when diagnosing machine learning models.


Key Takeaways

Let’s summarize the most important lessons from this guide.

  • Overfitting vs underfitting in machine learning describes two common model performance problems.
  • Underfitting occurs when the model is too simple to capture data patterns.
  • Overfitting occurs when the model memorizes training data instead of learning patterns.
  • The goal is to find the right balance between bias and variance.
  • Techniques such as cross-validation, regularization, and feature engineering help improve model performance.

Mastering these concepts is essential for building reliable machine learning systems.


Practice Exercises

To strengthen your understanding, try the following exercises.

Exercise 1

Modify the polynomial degree in the earlier example.

Observe how the training and testing errors change.

Try values like:

  • Degree 2
  • Degree 5
  • Degree 10

Exercise 2

Experiment with Ridge Regression using different alpha values.

Try:

  • alpha = 0.1
  • alpha = 1
  • alpha = 10

Observe how model performance changes.


Exercise 3

Create your own synthetic dataset using NumPy.

Try to identify:

  • Underfitting
  • Good fit
  • Overfitting

Additional Resources

To deepen your understanding of overfitting vs underfitting in machine learning, consider the following resources.

Recommended books:

  • Introduction to Statistical Learning
  • Elements of Statistical Learning

Useful documentation:

  • scikit-learn official documentation
  • Python data science tutorials

These resources provide deeper theoretical explanations and practical examples.


Final Thoughts

Understanding overfitting vs underfitting in machine learning is one of the most important steps for anyone starting a machine learning journey.

Most beginners focus heavily on algorithms, but the real success of machine learning projects often depends on how well models generalize to unseen data.

By learning how to detect and fix these issues using Python, you are already developing the mindset of a professional machine learning engineer.

As you build more projects, always remember the key principle:

The goal of machine learning is not perfect training performance — it is reliable real-world predictions.

Once you understand model evaluation concepts like overfitting and underfitting, you can apply them to real projects such as Build an AI Text Analyzer with Python.


FAQ – Overfitting vs Underfitting in Machine Learning

What is overfitting vs underfitting in machine learning?

Underfitting occurs when a model is too simple and cannot learn the patterns in the data.
Overfitting occurs when a model is too complex and memorizes the training data instead of learning general patterns.

How can you detect overfitting and underfitting?

You can detect these problems by comparing training and testing performance.
High training error + high testing error → Underfitting
Low training error + high testing error → Overfitting

What causes overfitting in machine learning?

Overfitting usually happens due to:
Very complex models
Small datasets
Too many features
Lack of regularization

How can overfitting be prevented?

Common ways to prevent overfitting include:
Using cross-validation
Applying regularization (Ridge or Lasso)
Reducing model complexity
Collecting more data

What causes underfitting in machine learning?

Underfitting happens when the model is too simple or does not have enough features to learn the data patterns.

Why is train-test split important?

Train-test split allows you to evaluate model performance on unseen data, helping detect overfitting and underfitting.

Leave a Comment

Your email address will not be published. Required fields are marked *