Train Test Split in Python (Scikit-Learn): Beginner Guide with Model Training Example (2026)

Train Test Split in Python illustration showing dataset split into training data and testing data for machine learning model evaluation

Machine learning models can sometimes behave like students who memorize answers instead of truly understanding a subject.

Imagine a student preparing for an exam. Instead of learning concepts, the student memorizes the exact answers from practice questions. When the actual exam arrives with slightly different questions, the student fails.

Machine learning models can fall into the same trap.

If a model is trained and tested on the same dataset, it may memorize patterns instead of learning how to generalize. The model might show perfect performance during training but fail when it encounters new real-world data.

This is exactly why Train Test Split in Python is one of the most important steps in the machine learning workflow.

By splitting a dataset into two parts:

  • Training data
  • Testing data

we ensure that the model is evaluated on data it has never seen before.

In this beginner-friendly guide, you will learn:

  • What Train Test Split in Python means
  • Why it is important in machine learning
  • How to implement it using Scikit-Learn
  • A complete example including model training and evaluation

By the end of this tutorial, you will understand how train_test_split in Python works and how it fits into a real machine learning workflow.

Before building models, it’s important to understand the complete machine learning workflow in Python, including data preparation, training, and evaluation.


What is Train Test Split in Python

Train Test Split in Python is a technique used in machine learning to divide a dataset into two separate parts.

These parts are called:

  1. Training dataset
  2. Testing dataset

The training dataset is used to train the machine learning model, while the testing dataset is used to evaluate how well the model performs.

Instead of evaluating the model on the same data used for training, we test the model on unseen data.

This approach helps us measure how well the model can generalize to new information.

For example, imagine you have a dataset containing house prices.

Your dataset might include:

  • house size
  • number of rooms
  • location
  • price

If we train a model on the entire dataset and evaluate it on the same data, the model might appear perfect.

However, when we try to predict prices for new houses, the model may perform poorly.

Using Train Test Split in Python solves this problem by ensuring that some data remains unseen during training.

The workflow usually looks like this:

Dataset → Split → Train Model → Test Model

This simple step dramatically improves the reliability of machine learning models.

Train Test Split in Python diagram showing dataset divided into 80 percent training data and 20 percent testing data

Training Data vs Testing Data Explained

To fully understand Train Test Split in Python, we must clearly understand the difference between training data and testing data.

Training Data

Training data is the portion of the dataset used to teach the machine learning model.

During this stage, the model learns patterns, relationships, and structures from the data.

For example, in a salary prediction model:

Training data might include:

ExperienceSalary
1 year20000
3 years30000
5 years40000

The model learns how experience affects salary.

The model builds a mathematical relationship from this training data.


Testing Data

Testing data is used to evaluate the performance of the trained model.

The testing dataset is never shown to the model during training.

After the model is trained, we give it the testing data and see how accurate the predictions are.

For example:

ExperienceActual Salary
4 years35000

The model predicts a value.

Then we compare the predicted value with the real value.

This tells us how well the model performs on unseen data.

Using Train Test Split in Python ensures that the model is evaluated fairly.


Train / Validation / Test Split (Important Concept)

In many machine learning tutorials, you will see datasets divided into three parts instead of two.

These parts are:

  • Training dataset
  • Validation dataset
  • Testing dataset

Each part serves a different purpose.

DatasetPurpose
TrainingModel learning
ValidationModel tuning
TestingFinal evaluation

The validation dataset helps adjust model parameters such as:

  • learning rate
  • model complexity
  • hyperparameters

Typical dataset splits include:

70% training
15% validation
15% testing

or

60% training
20% validation
20% testing

However, beginners usually start with a simple Train Test Split in Python, which divides the dataset into only two parts.

For most beginner projects and tutorials, a two-way split is perfectly sufficient.

Later, when working on advanced machine learning models, you will encounter validation datasets more frequently.


Train Test Split in Python Using Scikit-Learn

In Python, the easiest way to perform Train Test Split in Python is by using the Scikit-Learn library.

Scikit-Learn is one of the most popular machine learning libraries in Python.

It provides powerful tools for:

  • machine learning models
  • preprocessing
  • evaluation
  • dataset splitting

First, install the library if it is not already installed.

pip install scikit-learn

Next, import the function used for dataset splitting.

from sklearn.model_selection import train_test_split

The train_test_split() function automatically divides the dataset into training and testing sets.

The basic syntax looks like this:

train_test_split(X, y, test_size, train_size, random_state)

Where:

  • X represents input features
  • y represents target values
  • test_size controls the percentage of test data
  • train_size controls training data size
  • random_state ensures reproducibility

Using Train Test Split in Python with Scikit-Learn takes only a single line of code, which makes it extremely convenient for machine learning projects.

The easiest way to perform train test split in Python is by using the powerful Scikit-Learn library in Python, which provides many tools for machine learning.

You can explore more details in the official Scikit-Learn train_test_split documentation.


What is the Ideal Train Test Split Ratio

One common question beginners ask is:

What is the best dataset split ratio?

There is no universal rule, but some ratios are widely used.

80 / 20 Split

This is the most common ratio.

80% training data
20% testing data

This works well for most datasets.


70 / 30 Split

Used when datasets are smaller.

More data is reserved for testing.


90 / 10 Split

Used for very large datasets.

Large datasets already contain enough training examples.


The goal of Train Test Split in Python is to ensure:

  • enough data for training
  • enough data for reliable testing

Choosing the right ratio depends on the dataset size and project complexity.


Complete Train Test Split Python Example (Mini Machine Learning Workflow)

So far, we have learned the theory behind Train Test Split in Python. Now it’s time to see how it works in a real example.

Many tutorials stop right after splitting the dataset. However, a beginner needs to see the complete workflow, including:

  1. Splitting the dataset
  2. Training a model
  3. Making predictions
  4. Evaluating the model

In this section, we will build a simple machine learning example using Train Test Split in Python and a basic regression model.


Step 1: Import Required Libraries

First, we import the required libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Here is what each library does:

LibraryPurpose
pandasUsed to handle datasets
train_test_splitSplits the dataset
LinearRegressionMachine learning model
mean_squared_errorEvaluates prediction error

These tools together allow us to build a simple machine learning pipeline.


Step 2: Create a Simple Dataset

For this tutorial, we will create a small dataset manually.

The dataset represents a relationship between work experience and salary.

data = {
'experience':[1,2,3,4,5,6,7,8],
'salary':[20000,25000,30000,35000,40000,45000,50000,55000]
}df = pd.DataFrame(data)

The dataset now looks like this:

ExperienceSalary
120000
225000
330000
435000
540000
645000
750000
855000

In a real machine learning project, the dataset would usually come from:

  • CSV files
  • databases
  • APIs

But the process of Train Test Split in Python remains the same.

In real projects, datasets are usually loaded using Pandas for data analysis in Python before applying machine learning techniques.


Step 3: Define Features and Target Variables

In machine learning, we divide data into:

  • Features (X)
  • Target variable (y)

Features are the input values used to make predictions.

The target variable is the value we want to predict.

In our dataset:

Experience → Feature
Salary → Target

X = df[['experience']]
y = df['salary']

Here:

  • X contains the input variable
  • y contains the output variable

This step prepares the dataset for Train Test Split in Python.


Step 4: Apply Train Test Split in Python

Now we split the dataset into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)

This line performs Train Test Split in Python.

Let’s understand what is happening.

test_size = 0.2

This means:

20% of the dataset will be used for testing.

80% of the dataset will be used for training.

random_state = 42

This ensures that the split remains consistent every time the code runs.

Without random_state, the dataset split could change each time you execute the code.

Using a fixed random state makes experiments reproducible.


Step 5: Train the Machine Learning Model

After performing Train Test Split in Python, the next step is to train the machine learning model.

In this example we use Linear Regression in Scikit-Learn, one of the simplest machine learning models..

model = LinearRegression()
model.fit(X_train, y_train)

The fit() function trains the model using the training dataset.

During training, the model learns the relationship between:

Experience → Salary

The model now creates a mathematical formula that can predict salary based on experience.


Step 6: Make Predictions

Once the model is trained, we use it to make predictions on the testing dataset.

predictions = model.predict(X_test)

These predictions represent the model’s estimated salary values based on experience.

The predictions come from data the model has never seen before, which is exactly why Train Test Split in Python is important.


Step 7: Evaluate the Model

Now we evaluate how accurate the model is.

Model performance can be measured using metrics such as Mean Squared Error in machine learning.

mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

Mean Squared Error measures how far the predictions are from the real values.

Lower MSE indicates better performance.

For example:

PredictionActualError
36000350001000

If the model consistently produces small errors, it means the model is performing well.

This simple workflow demonstrates the real power of Train Test Split in Python.


Understanding Train Test Split Parameters

The train_test_split() function contains several parameters that control how the dataset is divided.

Understanding these parameters helps you use Train Test Split in Python effectively.


test_size Parameter

The test_size parameter determines the percentage of the dataset used for testing.

Example:

test_size = 0.2

This means:

20% testing data
80% training data

Example with different values:

test_size = 0.3

Now the dataset becomes:

70% training
30% testing

The test_size parameter is one of the most commonly used settings in Train Test Split in Python.


train_size Parameter

The train_size parameter controls how much data is used for training.

Example:

train_size = 0.75

This means:

75% training data
25% testing data

Although test_size is used more frequently, train_size provides additional flexibility when controlling dataset proportions.


Using Both train_size and test_size

You can also specify both parameters.

Example:

train_test_split(X, y, train_size=0.7, test_size=0.3)

This explicitly defines:

70% training
30% testing

When both parameters are defined, they must add up to less than or equal to 1.

Using both parameters helps beginners clearly control the dataset structure when using Train Test Split in Python.


random_state Parameter

The random_state parameter ensures that the dataset split is reproducible.

Example:

random_state = 42

Without random_state:

Each time you run the code, the dataset may split differently.

With random_state:

The same training and testing sets are produced every time.

This makes debugging and experimentation easier.

That is why most tutorials include random_state when using Train Test Split in Python.


Visual Explanation of Train Test Split

To better understand Train Test Split in Python, imagine a dataset with 100 rows.

Before splitting:

Dataset
100 samples

After applying an 80 / 20 split:

Training Data → 80 samples
Testing Data → 20 samples

Visual representation:

Dataset

████████████████████████████████████████████████████

After split

Training Data
████████████████████████████████████████

Testing Data
████████

The model learns patterns from the training section and is evaluated using the testing section.

This ensures that the evaluation reflects how the model will perform on new unseen data.


Common Mistakes Beginners Make When Using Train Test Split in Python

Even though Train Test Split in Python is easy to implement, beginners often make several mistakes that can lead to misleading results.

Understanding these mistakes will help you build more reliable machine learning models.


1. Training and Testing on the Same Dataset

One of the most common beginner mistakes is evaluating a model on the same dataset used for training.

When this happens, the model may appear extremely accurate because it has already seen the data.

However, this does not reflect real-world performance.

Example problem:

If a model memorizes training data patterns instead of learning general rules, it will fail when encountering new data.

Using Train Test Split in Python prevents this issue by ensuring the model is tested on unseen data.


2. Using a Very Small Test Dataset

Another mistake is allocating too little data for testing.

For example:

95% training data
5% testing data

This can produce unreliable evaluation results because the test dataset may not represent the overall dataset.

A better approach is using common ratios like:

  • 80 / 20 split
  • 70 / 30 split

These ratios provide enough data for both training and testing when using Train Test Split in Python.


3. Forgetting random_state

Beginners sometimes forget to include the random_state parameter.

Without random_state, the dataset split changes every time the code runs.

This makes it difficult to reproduce results or debug models.

Example:

train_test_split(X, y, test_size=0.2, random_state=42)

Setting random_state ensures consistent results whenever you run the experiment.


4. Data Leakage

Data leakage occurs when information from the test dataset accidentally influences the training process.

Proper data preprocessing in machine learning is essential to avoid issues such as data leakage and incorrect dataset splitting.

This often happens when preprocessing steps are performed before splitting the dataset.

Incorrect workflow:

  1. Normalize dataset
  2. Split dataset

Correct workflow:

  1. Split dataset
  2. Apply preprocessing to training data

Preventing leakage is critical when using Train Test Split in Python.


5. Splitting Data Incorrectly for Time-Based Datasets

Some datasets contain time-based information.

Examples include:

  • stock prices
  • weather forecasts
  • sales predictions

In these cases, random splitting may break the time sequence.

Instead of random splitting, you should use time-based splitting, where earlier data is used for training and later data is used for testing.

Understanding the dataset type is important when applying Train Test Split in Python.


Train Test Split vs Cross Validation

While Train Test Split in Python is widely used, it is not the only evaluation technique.

Another powerful method is Cross Validation.

Both methods aim to evaluate model performance but work differently.


Train Test Split

In Train Test Split, the dataset is divided once into two parts:

  • training dataset
  • testing dataset

The model is trained once and evaluated once.

Example structure:

DatasetPurpose
Training DataModel training
Testing DataModel evaluation

Advantages:

  • simple to implement
  • fast evaluation
  • ideal for beginner projects

However, it may sometimes produce unreliable results if the dataset is small.


Cross Validation

Cross Validation improves model evaluation by using multiple dataset splits.

The most common type is K-Fold Cross Validation.

In K-Fold Cross Validation:

  1. The dataset is divided into K equal parts
  2. The model trains on K-1 parts
  3. The remaining part is used for testing
  4. The process repeats K times

Example with 5-fold cross validation:

FoldTraining DataTesting Data
14 folds1 fold
24 folds1 fold
34 folds1 fold
44 folds1 fold
54 folds1 fold

The final performance is calculated as the average of all evaluations.


When to Use Each Method

MethodBest For
Train Test SplitBeginner projects
Cross ValidationResearch and production models
Train / Validation / Test SplitHyperparameter tuning

For most beginner tutorials, Train Test Split in Python remains the simplest and most practical starting point.

Train test split vs K fold cross validation infographic showing dataset evaluation methods in machine learning

Where Train Test Split Fits in the Machine Learning Workflow

Understanding where Train Test Split in Python fits in the machine learning pipeline helps beginners see the bigger picture.

A typical machine learning workflow looks like this:

  1. Data collection
  2. Data cleaning
  3. Feature engineering
  4. Train Test Split
  5. Model training
  6. Model evaluation
  7. Model improvement

The dataset is usually split before training the model.

This ensures the model learns patterns only from the training dataset while the testing dataset remains unseen.

This separation is what allows us to measure real-world performance.

Machine learning workflow diagram showing dataset splitting, model training, predictions, and evaluation

When to Use Train Test Split in Machine Learning Projects

You will encounter Train Test Split in Python in many machine learning applications.

Some common examples include:

Regression Models

Predicting continuous values such as:

  • house prices
  • salaries
  • sales revenue

Train Test Split helps measure prediction accuracy.


Classification Models

Used for predicting categories such as:

  • spam vs non-spam emails
  • disease detection
  • sentiment analysis

Testing datasets ensure that the classifier works on unseen data.


Natural Language Processing (NLP)

In NLP projects, Train Test Split in Python is used for tasks like:

  • text classification
  • sentiment analysis
  • spam detection

Models must be evaluated on unseen text data to ensure reliability.


AI Prediction Systems

Many AI systems rely on proper dataset splitting.

Examples include:

  • recommendation systems
  • fraud detection
  • customer behavior prediction

Without proper dataset splitting, model evaluation would be unreliable.


FAQs: Train Test Split in Python

What does train_test_split do in Python?

The train_test_split() function divides a dataset into training and testing subsets.
This allows machine learning models to learn from one part of the data and be evaluated on unseen data.

What is the best train test split ratio?

The most commonly used ratios are:
80 / 20
70 / 30
Large datasets sometimes use 90 / 10 splits.
The best ratio depends on dataset size and model complexity.

Why do we use random_state in train_test_split?

The random_state parameter ensures that the dataset split remains the same every time the code runs.
This makes experiments reproducible and easier to debug.

What is the difference between training data and testing data?

Training data is used to teach the machine learning model patterns.
Testing data is used to evaluate how well the model performs on unseen data.
Both datasets are essential when using Train Test Split in Python.

Is train test split enough for machine learning models?

For beginner machine learning projects, Train Test Split in Python is usually enough.
However, advanced projects often use Cross Validation or Train / Validation / Test splits for more reliable evaluation.

Conclusion

Understanding Train Test Split in Python is essential for anyone learning machine learning.

Without splitting the dataset, it becomes impossible to accurately measure how well a model performs on new data.

In this guide, we explored:

  • what Train Test Split in Python means
  • why dataset splitting is important
  • how to implement it using Scikit-Learn
  • a complete machine learning example including model training and evaluation

We also discussed common mistakes beginners make and how techniques like Cross Validation can further improve model evaluation.

As you continue learning machine learning, Train Test Split in Python will appear in almost every project you build.

Mastering this concept early will help you create models that are not only accurate during training but also reliable when deployed in real-world applications.

If you want to understand the full machine learning pipeline, read our complete guide on machine learning workflow in Python.

Leave a Comment

Your email address will not be published. Required fields are marked *