Pandas - Handling Missing Data (fillna, dropna, interpolate)

Chapters

Python Pandas Tutorial

Pandas – Handling Missing Data (fillna, dropna, interpolate)

Handling missing data is an essential step in data cleaning and preprocessing. Real-world datasets often contain gaps due to errors, incomplete entries, or system issues. In Python Pandas, missing values are usually represented as NaN. We can manage them using different strategies: dropping rows or columns with missing values (dropna()), filling them with constants, mean, median, or previous values (fillna()), or estimating them through interpolation (interpolate()). The choice depends on the dataset and analysis goals. Proper handling ensures that the model or analysis is accurate, reliable, and free from biases caused by incomplete information.

Real-world datasets often contain missing values. Pandas provides powerful functions to handle them effectively.

1. Creating Sample Data

import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    "Employee": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Department": ["HR", "IT", "Finance", "Sales", "IT"],
    "Salary": [50000, np.nan, 60000, np.nan, 55000],
    "Bonus": [5000, 4000, np.nan, 3000, np.nan],
    "JoiningYear": [2015, 2016, np.nan, 2018, 2019]
}

df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)

Output:

Employee	Department	Salary	Bonus	JoiningYear
Alice	HR	50000.0	5000.0	2015.0
Bob	IT	NaN	4000.0	2016.0
Charlie	Finance	60000.0	NaN	NaN
David	Sales	NaN	3000.0	2018.0
Eva	IT	55000.0	NaN	2019.0

2. Detecting Missing Values

print(df.isnull())      # Boolean mask of missing values
print(df.isnull().sum())  # Count missing values per column

3. Removing Missing Data with `dropna()`

# Drop rows with any missing value
df_drop_rows = df.dropna()
print("After dropping rows with NaN:")
print(df_drop_rows)

# Drop columns with any missing value
df_drop_cols = df.dropna(axis=1)
print("After dropping columns with NaN:")
print(df_drop_cols)

4. Filling Missing Data with `fillna()`

# Fill NaN with a constant
df_fill_const = df.fillna(0)
print("Fill NaN with 0:")
print(df_fill_const)

# Fill NaN with column mean
df_fill_mean = df.fillna(df.mean(numeric_only=True))
print("Fill NaN with column mean:")
print(df_fill_mean)

# Forward fill (use previous value)
df_ffill = df.fillna(method="ffill")
print("Forward Fill:")
print(df_ffill)

# Backward fill (use next value)
df_bfill = df.fillna(method="bfill")
print("Backward Fill:")
print(df_bfill)

5. Interpolating Missing Data

Interpolation fills missing values by estimating based on other values.

df_interpolated = df.interpolate(method="linear")
print("After Interpolation:")
print(df_interpolated)

6. Example Use Cases

dropna() → Useful when you can safely ignore missing rows/columns.
fillna() → Use when you have a default value, mean, median, or want forward/backward filling.
interpolate() → Best for numerical data like time series.

✅ Now you can practice handling missing data directly with this dataset.

Perfect example 👌 — you’ve shown a time series dataset where some dates are missing.
To demonstrate time-based interpolation, let’s recreate this dataset with gaps in the dates and then use Pandas interpolation.

📘 Example 1: Interpolating Missing Dates in Time Series

import pandas as pd
import numpy as np

# Sample data with missing dates
data = {
    "day": [
        "1/1/2017", "1/4/2017", "1/5/2017", 
        "1/6/2017", "1/7/2017", "1/8/2017", 
        "1/9/2017", "1/10/2017", "1/11/2017"
    ],
    "temperature": [32, np.nan, 28, np.nan, 32, np.nan, np.nan, 34, 40],
    "windspeed": [6, 9, np.nan, 7, np.nan, np.nan, np.nan, 8, 12],
    "event": ["Rain", "Sunny", "Snow", np.nan, "Rain", "Sunny", np.nan, "Cloudy", "Sunny"]
}

df = pd.DataFrame(data)
df["day"] = pd.to_datetime(df["day"])  # convert to datetime
df.set_index("day", inplace=True)

print("Original DataFrame with Missing Dates:")
print(df)

# Reindex to include all dates
date_range = pd.date_range(start="2017-01-01", end="2017-01-11")
df = df.reindex(date_range)

print("\nDataFrame with Full Date Range (NaN for missing dates):")
print(df)

# Interpolate missing numeric values
df_interpolated = df.interpolate(method="time")

print("\nDataFrame after Time Interpolation:")
print(df_interpolated)

🔑 Explanation

Missing Dates are filled using pd.date_range() and reindex().
interpolate(method="time") uses actual time gaps to estimate missing values.
Non-numeric columns like "event" remain unchanged (still NaN).

✅ With this, you can clearly see how Pandas fills gaps in time series data smoothly.

Got it 👍 — let’s take a different real-world style example to show how interpolate(method="time") works.

📘 Example 2: Stock Prices with Missing Dates

Imagine you have stock price data, but trading was suspended on some days, leaving gaps.

import pandas as pd
import numpy as np

# Stock price dataset with missing days
data = {
    "date": ["2023-01-01", "2023-01-03", "2023-01-06", "2023-01-07"],
    "stock_price": [100, np.nan, 110, 115],
    "volume": [1000, 1200, np.nan, 1500]
}

df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)

print("Original DataFrame:")
print(df)

# Reindex to fill missing dates
date_range = pd.date_range(start="2023-01-01", end="2023-01-07")
df = df.reindex(date_range)

print("\nDataFrame with Missing Dates:")
print(df)

# Interpolate numeric values using time
df_interpolated = df.interpolate(method="time")

print("\nAfter Time Interpolation:")
print(df_interpolated)

🔎 What happens here?

Missing 2023-01-02, 2023-01-04, 2023-01-05 are added.
stock_price and volume are estimated based on actual time gaps, not just row order.
Great for financial, weather, or sensor data where time is critical.

👉 This way, you can practice interpolation on a dataset other than weather.

Pandas – Handling Missing Data (fillna, dropna, interpolate)

Real-world datasets often contain missing values. Pandas provides powerful functions to handle them effectively.

1. Creating Sample Data

import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    "Employee": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Department": ["HR", "IT", "Finance", "Sales", "IT"],
    "Salary": [50000, np.nan, 60000, np.nan, 55000],
    "Bonus": [5000, 4000, np.nan, 3000, np.nan],
    "JoiningYear": [2015, 2016, np.nan, 2018, 2019]
}

df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)

Output:

Employee	Department	Salary	Bonus	JoiningYear
Alice	HR	50000.0	5000.0	2015.0
Bob	IT	NaN	4000.0	2016.0
Charlie	Finance	60000.0	NaN	NaN
David	Sales	NaN	3000.0	2018.0
Eva	IT	55000.0	NaN	2019.0

2. Detecting Missing Values

print(df.isnull())      # Boolean mask of missing values
print(df.isnull().sum())  # Count missing values per column

3. Removing Missing Data with `dropna()`

# Drop rows with any missing value
df_drop_rows = df.dropna()
print("After dropping rows with NaN:")
print(df_drop_rows)

# Drop columns with any missing value
df_drop_cols = df.dropna(axis=1)
print("After dropping columns with NaN:")
print(df_drop_cols)

4. Filling Missing Data with `fillna()`

# Fill NaN with a constant
df_fill_const = df.fillna(0)
print("Fill NaN with 0:")
print(df_fill_const)

# Fill NaN with column mean
df_fill_mean = df.fillna(df.mean(numeric_only=True))
print("Fill NaN with column mean:")
print(df_fill_mean)

# Forward fill (use previous value)
df_ffill = df.fillna(method="ffill")
print("Forward Fill:")
print(df_ffill)

# Backward fill (use next value)
df_bfill = df.fillna(method="bfill")
print("Backward Fill:")
print(df_bfill)

5. Interpolating Missing Data

Interpolation fills missing values by estimating based on other values.

df_interpolated = df.interpolate(method="linear")
print("After Interpolation:")
print(df_interpolated)

6. Example Use Cases

dropna() → Useful when you can safely ignore missing rows/columns.
fillna() → Use when you have a default value, mean, median, or want forward/backward filling.
interpolate() → Best for numerical data like time series.

✅ Now you can practice handling missing data directly with this dataset.

📘 Example 1: Interpolating Missing Dates in Time Series

import pandas as pd
import numpy as np

# Sample data with missing dates
data = {
    "day": [
        "1/1/2017", "1/4/2017", "1/5/2017", 
        "1/6/2017", "1/7/2017", "1/8/2017", 
        "1/9/2017", "1/10/2017", "1/11/2017"
    ],
    "temperature": [32, np.nan, 28, np.nan, 32, np.nan, np.nan, 34, 40],
    "windspeed": [6, 9, np.nan, 7, np.nan, np.nan, np.nan, 8, 12],
    "event": ["Rain", "Sunny", "Snow", np.nan, "Rain", "Sunny", np.nan, "Cloudy", "Sunny"]
}

df = pd.DataFrame(data)
df["day"] = pd.to_datetime(df["day"])  # convert to datetime
df.set_index("day", inplace=True)

print("Original DataFrame with Missing Dates:")
print(df)

# Reindex to include all dates
date_range = pd.date_range(start="2017-01-01", end="2017-01-11")
df = df.reindex(date_range)

print("\nDataFrame with Full Date Range (NaN for missing dates):")
print(df)

# Interpolate missing numeric values
df_interpolated = df.interpolate(method="time")

print("\nDataFrame after Time Interpolation:")
print(df_interpolated)

🔑 Explanation

Missing Dates are filled using pd.date_range() and reindex().
interpolate(method="time") uses actual time gaps to estimate missing values.
Non-numeric columns like "event" remain unchanged (still NaN).

✅ With this, you can clearly see how Pandas fills gaps in time series data smoothly.

Got it 👍 — let’s take a different real-world style example to show how interpolate(method="time") works.

📘 Example 2: Stock Prices with Missing Dates

Imagine you have stock price data, but trading was suspended on some days, leaving gaps.

import pandas as pd
import numpy as np

# Stock price dataset with missing days
data = {
    "date": ["2023-01-01", "2023-01-03", "2023-01-06", "2023-01-07"],
    "stock_price": [100, np.nan, 110, 115],
    "volume": [1000, 1200, np.nan, 1500]
}

df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)

print("Original DataFrame:")
print(df)

# Reindex to fill missing dates
date_range = pd.date_range(start="2023-01-01", end="2023-01-07")
df = df.reindex(date_range)

print("\nDataFrame with Missing Dates:")
print(df)

# Interpolate numeric values using time
df_interpolated = df.interpolate(method="time")

print("\nAfter Time Interpolation:")
print(df_interpolated)

🔎 What happens here?

Missing 2023-01-02, 2023-01-04, 2023-01-05 are added.
stock_price and volume are estimated based on actual time gaps, not just row order.
Great for financial, weather, or sensor data where time is critical.

👉 This way, you can practice interpolation on a dataset other than weather.

Chapters

Pandas – Handling Missing Data (fillna, dropna, interpolate)

1. Creating Sample Data

2. Detecting Missing Values

3. Removing Missing Data with dropna()

4. Filling Missing Data with fillna()

5. Interpolating Missing Data

6. Example Use Cases

📘 Example 1: Interpolating Missing Dates in Time Series

🔑 Explanation

📘 Example 2: Stock Prices with Missing Dates

🔎 What happens here?

Pandas – Handling Missing Data (fillna, dropna, interpolate)

1. Creating Sample Data

2. Detecting Missing Values

3. Removing Missing Data with dropna()

4. Filling Missing Data with fillna()

5. Interpolating Missing Data

6. Example Use Cases

📘 Example 1: Interpolating Missing Dates in Time Series

🔑 Explanation

📘 Example 2: Stock Prices with Missing Dates

🔎 What happens here?

3. Removing Missing Data with `dropna()`

4. Filling Missing Data with `fillna()`

3. Removing Missing Data with `dropna()`

4. Filling Missing Data with `fillna()`