Handling missing data is an essential step in data cleaning and preprocessing. Real-world datasets often contain gaps due to errors, incomplete entries, or system issues. In Python Pandas, missing values are usually represented as NaN. We can manage them using different strategies: dropping rows or columns with missing values (dropna()
), filling them with constants, mean, median, or previous values (fillna()
), or estimating them through interpolation (interpolate()
). The choice depends on the dataset and analysis goals. Proper handling ensures that the model or analysis is accurate, reliable, and free from biases caused by incomplete information.
Real-world datasets often contain missing values. Pandas provides powerful functions to handle them effectively.
1. Creating Sample Data
import pandas as pd
import numpy as np
# Sample dataset with missing values
data = {
"Employee": ["Alice", "Bob", "Charlie", "David", "Eva"],
"Department": ["HR", "IT", "Finance", "Sales", "IT"],
"Salary": [50000, np.nan, 60000, np.nan, 55000],
"Bonus": [5000, 4000, np.nan, 3000, np.nan],
"JoiningYear": [2015, 2016, np.nan, 2018, 2019]
}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)
Output:
Employee | Department | Salary | Bonus | JoiningYear |
---|---|---|---|---|
Alice | HR | 50000.0 | 5000.0 | 2015.0 |
Bob | IT | NaN | 4000.0 | 2016.0 |
Charlie | Finance | 60000.0 | NaN | NaN |
David | Sales | NaN | 3000.0 | 2018.0 |
Eva | IT | 55000.0 | NaN | 2019.0 |
2. Detecting Missing Values
print(df.isnull()) # Boolean mask of missing values
print(df.isnull().sum()) # Count missing values per column
3. Removing Missing Data with dropna()
# Drop rows with any missing value
df_drop_rows = df.dropna()
print("After dropping rows with NaN:")
print(df_drop_rows)
# Drop columns with any missing value
df_drop_cols = df.dropna(axis=1)
print("After dropping columns with NaN:")
print(df_drop_cols)
4. Filling Missing Data with fillna()
# Fill NaN with a constant
df_fill_const = df.fillna(0)
print("Fill NaN with 0:")
print(df_fill_const)
# Fill NaN with column mean
df_fill_mean = df.fillna(df.mean(numeric_only=True))
print("Fill NaN with column mean:")
print(df_fill_mean)
# Forward fill (use previous value)
df_ffill = df.fillna(method="ffill")
print("Forward Fill:")
print(df_ffill)
# Backward fill (use next value)
df_bfill = df.fillna(method="bfill")
print("Backward Fill:")
print(df_bfill)
5. Interpolating Missing Data
Interpolation fills missing values by estimating based on other values.
df_interpolated = df.interpolate(method="linear")
print("After Interpolation:")
print(df_interpolated)
6. Example Use Cases
- dropna() → Useful when you can safely ignore missing rows/columns.
- fillna() → Use when you have a default value, mean, median, or want forward/backward filling.
- interpolate() → Best for numerical data like time series.
✅ Now you can practice handling missing data directly with this dataset.
Perfect example 👌 — you’ve shown a time series dataset where some dates are missing.
To demonstrate time-based interpolation, let’s recreate this dataset with gaps in the dates and then use Pandas interpolation.
📘 Example 1: Interpolating Missing Dates in Time Series
import pandas as pd
import numpy as np
# Sample data with missing dates
data = {
"day": [
"1/1/2017", "1/4/2017", "1/5/2017",
"1/6/2017", "1/7/2017", "1/8/2017",
"1/9/2017", "1/10/2017", "1/11/2017"
],
"temperature": [32, np.nan, 28, np.nan, 32, np.nan, np.nan, 34, 40],
"windspeed": [6, 9, np.nan, 7, np.nan, np.nan, np.nan, 8, 12],
"event": ["Rain", "Sunny", "Snow", np.nan, "Rain", "Sunny", np.nan, "Cloudy", "Sunny"]
}
df = pd.DataFrame(data)
df["day"] = pd.to_datetime(df["day"]) # convert to datetime
df.set_index("day", inplace=True)
print("Original DataFrame with Missing Dates:")
print(df)
# Reindex to include all dates
date_range = pd.date_range(start="2017-01-01", end="2017-01-11")
df = df.reindex(date_range)
print("\nDataFrame with Full Date Range (NaN for missing dates):")
print(df)
# Interpolate missing numeric values
df_interpolated = df.interpolate(method="time")
print("\nDataFrame after Time Interpolation:")
print(df_interpolated)
🔑 Explanation
- Missing Dates are filled using
pd.date_range()
andreindex()
. interpolate(method="time")
uses actual time gaps to estimate missing values.- Non-numeric columns like
"event"
remain unchanged (stillNaN
).
✅ With this, you can clearly see how Pandas fills gaps in time series data smoothly.
Got it 👍 — let’s take a different real-world style example to show how interpolate(method="time")
works.
📘 Example 2: Stock Prices with Missing Dates
Imagine you have stock price data, but trading was suspended on some days, leaving gaps.
import pandas as pd
import numpy as np
# Stock price dataset with missing days
data = {
"date": ["2023-01-01", "2023-01-03", "2023-01-06", "2023-01-07"],
"stock_price": [100, np.nan, 110, 115],
"volume": [1000, 1200, np.nan, 1500]
}
df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
print("Original DataFrame:")
print(df)
# Reindex to fill missing dates
date_range = pd.date_range(start="2023-01-01", end="2023-01-07")
df = df.reindex(date_range)
print("\nDataFrame with Missing Dates:")
print(df)
# Interpolate numeric values using time
df_interpolated = df.interpolate(method="time")
print("\nAfter Time Interpolation:")
print(df_interpolated)
🔎 What happens here?
- Missing 2023-01-02, 2023-01-04, 2023-01-05 are added.
stock_price
andvolume
are estimated based on actual time gaps, not just row order.- Great for financial, weather, or sensor data where time is critical.
👉 This way, you can practice interpolation on a dataset other than weather.