RESEARCH

Data preprocessing is a crucial step in the data analysis and machine learning pipeline

26 JANUARY 2025

Mark Sikaundi - Data Scientist and AI Researcher.

Share this post

A new generation of African talent brings cutting-edge AI to scientific challenges

Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves transforming raw data into a format that is more suitable for analysis and modeling.

Common preprocessing techniques include scaling, normalization, and encoding. Below, are techniques and provide example code for each.

Scaling

Scaling is a technique used to standardize the range of independent variables or features of data. It is an important step in data preprocessing as it helps to normalize the data within a particular range. This is particularly important for algorithms that are sensitive to the scale of the input data, such as support vector machines and k-nearest neighbors.


          import numpy as np
from sklearn.preprocessing import StandardScaler

# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print("Standardized Data:
", scaled_data)

Min-Max Scaling: Min-Max scaling scales the data to a fixed range, usually [0, 1].


from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data
min_max_scaled_data = min_max_scaler.fit_transform(data)

print("Min-Max Scaled Data:
", min_max_scaled_data)

Normalization

Normalization is a technique used to scale the data in such a way that it falls within a specific range, typically 0 and 1. It is useful when the features have different units or scales. Normalization helps to improve the convergence of the machine learning algorithms and speeds up training.

Normalization adjusts the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. It typically scales the data to have a unit norm (e.g., L2 norm).


            from sklearn.preprocessing import Normalizer

# Initialize the normalizer
normalizer = Normalizer()

# Fit and transform the data
normalized_data = normalizer.fit_transform(data)

print("Normalized Data:
", normalized_data)

Encoding

Encoding is a technique used to convert categorical data into a numerical format that can be used for machine learning algorithms. There are different encoding techniques, such as one-hot encoding and label encoding. One-hot encoding is used when the categories are not ordinal, while label encoding is used when the categories have an ordinal relationship.

Label Encoding: Label encoding assigns a unique integer to each category.


from sklearn.preprocessing import LabelEncoder

# Example categorical data
categories = ['cat', 'dog', 'fish', 'cat', 'dog']

# Initialize the encoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_labels = label_encoder.fit_transform(categories)

print("Label Encoded Data:
", encoded_labels)

One-Hot Encoding: One-hot encoding creates a binary column for each category and returns a sparse matrix or dense array.


            
from sklearn.preprocessing import OneHotEncoder

# Example categorical data
categories = np.array(['cat', 'dog', 'fish', 'cat', 'dog']).reshape(-1, 1)

# Initialize the encoder
one_hot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the data
one_hot_encoded_data = one_hot_encoder.fit_transform(categories)

print("One-Hot Encoded Data:
", one_hot_encoded_data)

Applying Preprocessing to a Dataset

Let's apply these preprocessing techniques to a sample dataset.


            import pandas as pd
from sklearn.model_selection import train_test_split

# Example dataset
data = {
    'age': [25, 45, 35, 50, 23],
    'salary': [50000, 100000, 75000, 120000, 45000],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(data)

# Splitting the dataset into features and target
X = df[['age', 'salary', 'city']]
y = [1, 0, 1, 0, 1]  # Example target variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying scaling to numerical features
scaler = StandardScaler()
X_train[['age', 'salary']] = scaler.fit_transform(X_train[['age', 'salary']])
X_test[['age', 'salary']] = scaler.transform(X_test[['age', 'salary']])

# Applying one-hot encoding to categorical features
one_hot_encoder = OneHotEncoder(sparse=False)
X_train_city_encoded = one_hot_encoder.fit_transform(X_train[['city']])
X_test_city_encoded = one_hot_encoder.transform(X_test[['city']])

# Concatenating the encoded categorical features with the scaled numerical features
X_train_preprocessed = np.hstack((X_train[['age', 'salary']], X_train_city_encoded))
X_test_preprocessed = np.hstack((X_test[['age', 'salary']], X_test_city_encoded))

print("Preprocessed Training Data:
", X_train_preprocessed)
print("Preprocessed Testing Data:
", X_test_preprocessed)

In this example, we first split the dataset into training and testing sets. We then applied standard scaling to the numerical features (age and salary) and one-hot encoding to the categorical feature (city). Finally, we concatenated the preprocessed numerical and categorical features to form the final preprocessed dataset.

These preprocessing steps are essential for preparing data for machine learning models, ensuring that the data is in a suitable format and scale for the algorithms to learn effectively.

Explore more onLupleg Community