Pandas crosstab() Method

Chapters

Python Pandas Tutorial

🧩 What is Crosstab?

pandas.crosstab() is a frequency table tool that summarizes the relationship between two or more categorical variables.
It counts occurrences of combinations of categories — similar to a pivot table, but specifically for categorical comparison.

Think of it as a quick way to analyze relationships between variables like gender vs. survival, class vs. embarkation point, or smoker vs. day in a restaurant dataset.

🔹 Syntax

pd.crosstab(index, columns, values=None, aggfunc=None, margins=False, normalize=False)

Parameters:

index: array or Series — values to group by on the rows
columns: array or Series — values to group by on the columns
values: array or Series — optional values to aggregate
aggfunc: function (like np.sum, np.mean)
margins: True/False — adds row and column totals
normalize: Normalizes counts (proportions or percentages)

🧩 When Should You Use Crosstab?

Use crosstab() when you want to:

✅ Summarize categorical data
✅ Compare distributions between variables
✅ Analyze relationships between categories (e.g., gender and survival rate)
✅ Quickly compute two-way or multi-way frequency tables

Common use cases:

Gender vs. Survival — Were women more likely to survive?
Class vs. Embarkation Port — Which class boarded from which port?
Age Group vs. Survival — Which age group survived more?

🧩 Example: Titanic Dataset

Let’s load the Titanic dataset from Seaborn and explore.

import pandas as pd
import seaborn as sns

# Load dataset
titanic = sns.load_dataset('titanic')

# Display first few rows
print(titanic.head())

🧠 Dataset Overview

Column	Description
survived	0 = No, 1 = Yes
pclass	Passenger class (1st, 2nd, 3rd)
sex	Gender
age	Age in years
sibsp	# of siblings/spouses aboard
parch	# of parents/children aboard
fare	Ticket fare
embarked	Port of embarkation (C, Q, S)
class	Class label (First, Second, Third)
who	Man, woman, or child
deck	Deck letter
embark_town	Town of embarkation
alone	True/False — if traveling alone

🧩 Example 1 — Count of Survivors by Gender

pd.crosstab(titanic['sex'], titanic['survived'])

Output:

survived	0	1
female	81	233
male	468	109

✅ Interpretation:
Out of all females, 233 survived while 81 did not.
Among males, survival count was much lower (only 109 survived).
→ This confirms that females had a much higher survival rate.

🧩 Example 2 — Add Margins (Totals)

pd.crosstab(titanic['sex'], titanic['survived'], margins=True)

Output:

survived	0	1	All
female	81	233	314
male	468	109	577
All	549	342	891

✅ Adds totals for both rows and columns — similar to “Grand Totals” in Excel Pivot Tables.

🧩 Example 3 — Crosstab Between Class and Survival

pd.crosstab(titanic['class'], titanic['survived'])

survived	0	1
First	80	136
Second	97	87
Third	372	119

✅ Clearly, first-class passengers had the highest survival rates.

🧩 Example 4 — Normalized Crosstab (Proportions)

You can normalize results to get proportions instead of counts.

pd.crosstab(titanic['sex'], titanic['survived'], normalize='index')

survived	0	1
female	0.2579	0.7421
male	0.8117	0.1883

✅ Interpretation:

74% of females survived
Only 18% of males survived

🧩 Example 5 — Multi-variable Crosstab

Let’s analyze survival based on both gender and class.

pd.crosstab([titanic['sex'], titanic['class']], titanic['survived'])

sex	class	0	1
female	First	3	91
female	Second	6	70
female	Third	72	72
male	First	77	45
male	Second	91	17
male	Third	300	47

✅ Shows a detailed breakdown — e.g., 91 first-class females survived, only 47 third-class males survived.

🧩 Example 6 — Crosstab with an Aggregation Function

You can include a numeric column (values) and an aggregation function (aggfunc) to compute statistics.

Example: Average fare by class and survival.

pd.crosstab(
    titanic['class'],
    titanic['survived'],
    values=titanic['fare'],
    aggfunc='mean'
)

survived	0	1
First	64.68	95.12
Second	19.49	22.05
Third	13.30	13.68

✅ Interpretation:
Survivors generally paid higher fares, especially in first class.

🧩 Example 7 — Crosstab with `normalize='columns'`

Normalize column-wise to see proportions per survival outcome.

pd.crosstab(titanic['sex'], titanic['survived'], normalize='columns')

survived	0	1
female	0.1475	0.6813
male	0.8525	0.3187

✅ 68% of all survivors were female.

🧩 Example 8 — Using Multiple Columns for Index and Columns

Let’s explore how class and embark_town relate to survival.

pd.crosstab(
    [titanic['class'], titanic['embark_town']],
    titanic['survived'],
    margins=True
)

class	embark_town	0	1	All
First	Cherbourg	10	60	70
First	Queenstown	1	2	3
First	Southampton	69	74	143
Second	Cherbourg	8	15	23
Second	Queenstown	7	3	10
Second	Southampton	82	69	151
Third	Cherbourg	28	21	49
Third	Queenstown	47	7	54
Third	Southampton	297	91	388
All	All	549	342	891

✅ This gives a detailed breakdown of survival per class and boarding port.

🧩 Example 9 — Crosstab Visualization

Crosstabs can be visualized as heatmaps for better understanding.

import seaborn as sns
import matplotlib.pyplot as plt

ct = pd.crosstab(titanic['class'], titanic['survived'], normalize='index')

sns.heatmap(ct, annot=True, cmap='coolwarm')
plt.title("Survival Rate by Passenger Class")
plt.ylabel("Passenger Class")
plt.xlabel("Survived")
plt.show()

✅ The heatmap visually shows which classes had higher survival rates.

🧩 Summary

Concept	Description	Example
`pd.crosstab()`	Create frequency tables	`pd.crosstab(df['A'], df['B'])`
`margins=True`	Adds totals	`pd.crosstab(A,B,margins=True)`
`normalize='index'`	Row-wise percentage	`normalize='index'`
`normalize='columns'`	Column-wise percentage	`normalize='columns'`
`values` + `aggfunc`	Aggregate numeric data	`values=df['fare'], aggfunc='mean'`
Multiple levels	Multi-variable grouping	`pd.crosstab([A,B],C)`

✅ Key Takeaways

pd.crosstab() is perfect for categorical analysis and summarization.
Use it for comparing variables like gender, class, or survival.
Combine it with margins and normalize for deeper insights.
You can even aggregate numeric values (e.g., mean fare).
Great for exploratory data analysis (EDA) and reporting.