π§© What is Crosstab?
pandas.crosstab() is a frequency table tool that summarizes the relationship between two or more categorical variables.
It counts occurrences of combinations of categories β similar to a pivot table, but specifically for categorical comparison.
Think of it as a quick way to analyze relationships between variables like gender vs. survival, class vs. embarkation point, or smoker vs. day in a restaurant dataset.
πΉ Syntax
pd.crosstab(index, columns, values=None, aggfunc=None, margins=False, normalize=False)
Parameters:
index: array or Series β values to group by on the rowscolumns: array or Series β values to group by on the columnsvalues: array or Series β optional values to aggregateaggfunc: function (likenp.sum,np.mean)margins:True/Falseβ adds row and column totalsnormalize: Normalizes counts (proportions or percentages)
π§© When Should You Use Crosstab?
Use crosstab() when you want to:
β
Summarize categorical data
β
Compare distributions between variables
β
Analyze relationships between categories (e.g., gender and survival rate)
β
Quickly compute two-way or multi-way frequency tables
Common use cases:
- Gender vs. Survival β Were women more likely to survive?
- Class vs. Embarkation Port β Which class boarded from which port?
- Age Group vs. Survival β Which age group survived more?
π§© Example: Titanic Dataset
Letβs load the Titanic dataset from Seaborn and explore.
import pandas as pd
import seaborn as sns
# Load dataset
titanic = sns.load_dataset('titanic')
# Display first few rows
print(titanic.head())
π§ Dataset Overview
| Column | Description |
|---|---|
| survived | 0 = No, 1 = Yes |
| pclass | Passenger class (1st, 2nd, 3rd) |
| sex | Gender |
| age | Age in years |
| sibsp | # of siblings/spouses aboard |
| parch | # of parents/children aboard |
| fare | Ticket fare |
| embarked | Port of embarkation (C, Q, S) |
| class | Class label (First, Second, Third) |
| who | Man, woman, or child |
| deck | Deck letter |
| embark_town | Town of embarkation |
| alone | True/False β if traveling alone |
π§© Example 1 β Count of Survivors by Gender
pd.crosstab(titanic['sex'], titanic['survived'])
Output:
| survived | 0 | 1 |
|---|---|---|
| female | 81 | 233 |
| male | 468 | 109 |
β
Interpretation:
Out of all females, 233 survived while 81 did not.
Among males, survival count was much lower (only 109 survived).
β This confirms that females had a much higher survival rate.
π§© Example 2 β Add Margins (Totals)
pd.crosstab(titanic['sex'], titanic['survived'], margins=True)
Output:
| survived | 0 | 1 | All |
|---|---|---|---|
| female | 81 | 233 | 314 |
| male | 468 | 109 | 577 |
| All | 549 | 342 | 891 |
β Adds totals for both rows and columns β similar to βGrand Totalsβ in Excel Pivot Tables.
π§© Example 3 β Crosstab Between Class and Survival
pd.crosstab(titanic['class'], titanic['survived'])
| survived | 0 | 1 |
|---|---|---|
| First | 80 | 136 |
| Second | 97 | 87 |
| Third | 372 | 119 |
β Clearly, first-class passengers had the highest survival rates.
π§© Example 4 β Normalized Crosstab (Proportions)
You can normalize results to get proportions instead of counts.
pd.crosstab(titanic['sex'], titanic['survived'], normalize='index')
| survived | 0 | 1 |
|---|---|---|
| female | 0.2579 | 0.7421 |
| male | 0.8117 | 0.1883 |
β Interpretation:
- 74% of females survived
- Only 18% of males survived
π§© Example 5 β Multi-variable Crosstab
Letβs analyze survival based on both gender and class.
pd.crosstab([titanic['sex'], titanic['class']], titanic['survived'])
| sex | class | 0 | 1 |
|---|---|---|---|
| female | First | 3 | 91 |
| female | Second | 6 | 70 |
| female | Third | 72 | 72 |
| male | First | 77 | 45 |
| male | Second | 91 | 17 |
| male | Third | 300 | 47 |
β Shows a detailed breakdown β e.g., 91 first-class females survived, only 47 third-class males survived.
π§© Example 6 β Crosstab with an Aggregation Function
You can include a numeric column (values) and an aggregation function (aggfunc) to compute statistics.
Example: Average fare by class and survival.
pd.crosstab(
titanic['class'],
titanic['survived'],
values=titanic['fare'],
aggfunc='mean'
)
| survived | 0 | 1 |
|---|---|---|
| First | 64.68 | 95.12 |
| Second | 19.49 | 22.05 |
| Third | 13.30 | 13.68 |
β
Interpretation:
Survivors generally paid higher fares, especially in first class.
π§© Example 7 β Crosstab with normalize='columns'
Normalize column-wise to see proportions per survival outcome.
pd.crosstab(titanic['sex'], titanic['survived'], normalize='columns')
| survived | 0 | 1 |
|---|---|---|
| female | 0.1475 | 0.6813 |
| male | 0.8525 | 0.3187 |
β 68% of all survivors were female.
π§© Example 8 β Using Multiple Columns for Index and Columns
Letβs explore how class and embark_town relate to survival.
pd.crosstab(
[titanic['class'], titanic['embark_town']],
titanic['survived'],
margins=True
)
| class | embark_town | 0 | 1 | All |
|---|---|---|---|---|
| First | Cherbourg | 10 | 60 | 70 |
| First | Queenstown | 1 | 2 | 3 |
| First | Southampton | 69 | 74 | 143 |
| Second | Cherbourg | 8 | 15 | 23 |
| Second | Queenstown | 7 | 3 | 10 |
| Second | Southampton | 82 | 69 | 151 |
| Third | Cherbourg | 28 | 21 | 49 |
| Third | Queenstown | 47 | 7 | 54 |
| Third | Southampton | 297 | 91 | 388 |
| All | All | 549 | 342 | 891 |
β This gives a detailed breakdown of survival per class and boarding port.
π§© Example 9 β Crosstab Visualization
Crosstabs can be visualized as heatmaps for better understanding.
import seaborn as sns
import matplotlib.pyplot as plt
ct = pd.crosstab(titanic['class'], titanic['survived'], normalize='index')
sns.heatmap(ct, annot=True, cmap='coolwarm')
plt.title("Survival Rate by Passenger Class")
plt.ylabel("Passenger Class")
plt.xlabel("Survived")
plt.show()
β The heatmap visually shows which classes had higher survival rates.
π§© Summary
| Concept | Description | Example |
|---|---|---|
pd.crosstab() | Create frequency tables | pd.crosstab(df['A'], df['B']) |
margins=True | Adds totals | pd.crosstab(A,B,margins=True) |
normalize='index' | Row-wise percentage | normalize='index' |
normalize='columns' | Column-wise percentage | normalize='columns' |
values + aggfunc | Aggregate numeric data | values=df['fare'], aggfunc='mean' |
| Multiple levels | Multi-variable grouping | pd.crosstab([A,B],C) |
β Key Takeaways
pd.crosstab()is perfect for categorical analysis and summarization.- Use it for comparing variables like gender, class, or survival.
- Combine it with
marginsandnormalizefor deeper insights. - You can even aggregate numeric values (e.g., mean fare).
- Great for exploratory data analysis (EDA) and reporting.
