By Ankit Srivastava — Data Analytics Trainer
Chapter Objective
In this chapter, you will learn how to visualize data distribution using Histograms and Box Plots in Plotly.
These charts are extremely valuable for analysts, marketers, and data scientists when understanding:
✅ How data is spread
✅ What ranges values fall into
✅ Where data clusters occur
✅ Presence of outliers
✅ Variation in performance
While bar charts compare categories, histograms and box plots allow us to study numerical data behavior — essential for data profiling, customer analysis, marketing funnel evaluation, and predictive modeling.
Why Study Data Distribution?
Understanding how your data behaves helps answer critical questions like:
- Are most customers purchasing within a certain price range?
- What is the average time spent on a website?
- Are most video views clustered around a particular duration?
- What’s the variation in monthly ad spend?
- Are there outliers like extremely high cost per lead values?
For modern analytics-driven business roles — digital marketing, product analysis, finance, HR analytics — distribution insight is foundational.
Histograms in Plotly
What is a Histogram?
A histogram shows the frequency distribution of a numeric dataset by splitting values into bins (ranges).
Example:
If user session duration values are grouped as:
- 0–30 seconds
- 30–60 seconds
- 60–90 seconds
…it allows us to see how many users fall into each time bracket.
Use-Cases in Business & Marketing
| Use-case | Example |
|---|---|
| Audience behavior | Time on site distribution |
| Paid ads | CPC / CPA distribution |
| Sales | Order value distribution |
| Customer analytics | Age or income distribution |
| E-commerce | Basket value variation |
Histograms help make data-driven decisions and identify patterns like high bounce users, premium buyers, or unprofitable ad segments.
Simple Histogram Example
Let’s simulate website session durations:
import plotly.express as px
import numpy as np
# Generate sample session duration data
session_time = np.random.normal(loc=120, scale=40, size=400) # mean=120 seconds
fig = px.histogram(session_time,
nbins=20,
title="Distribution of Website Session Duration (in seconds)",
labels={'value':'Session Duration (seconds)'},
opacity=0.75)
fig.show()
✅ np.random.normal simulates real-world random behavior
✅ nbins=20 controls how many ranges are displayed
📌 Insight Example
You may observe most users spend 90–150 seconds — meaning engaged audience quality.
🧩 Histogram with Color Grouping
Let’s visualize session duration by traffic channel:
import pandas as pd
import plotly.express as px
import numpy as np
np.random.seed(42)
data = pd.DataFrame({
'Session Duration': np.random.normal(120, 30, 500).tolist() + np.random.normal(150, 20, 500).tolist(),
'Channel': ['Organic']*500 + ['Paid']*500
})
fig = px.histogram(data,
x="Session Duration",
color="Channel",
barmode="overlay",
nbins=25,
title="Session Duration Distribution by Traffic Source")
fig.show()
✅ Overlay patterns show which source has higher quality engagement
✅ Paid traffic has a higher average time? → Good ads audience
✅ Organic lower? → Improve SEO content quality
🎨 Customize Histogram Style
fig.update_layout(
xaxis_title="Session Duration (sec)",
yaxis_title="User Count",
template="plotly_white"
)
Add marginal visualization (mini box or violin beside it):
fig = px.histogram(data,
x="Session Duration",
color="Channel",
marginal="box", # or "violin"
nbins=30)
fig.show()
📌 Extremely effective in dashboards for deeper distribution insight.
📦 Box Plots in Plotly
🔎 What is a Box Plot?
A box plot displays:
- Median (middle value)
- Quartiles (25%, 75%)
- Spread / Variability
- Outliers
It’s like a compact statistical summary.
Key Advantage: Easily spot unusual values.
Why Box Plots Matter
| Benefit | Reason |
|---|---|
| Detect outliers | E.g., extreme CPL spikes |
| Understand variation | See stable vs volatile metrics |
| Benchmark performance | Compare multiple segments |
| Funnel analytics | Identify drop-off variance |
Example: If LinkedIn Ads has extreme cost variance — more optimization required.
Basic Box Plot Example
import plotly.express as px
import numpy as np
import pandas as pd
data = pd.DataFrame({
'CPL ($)': np.random.normal(20, 5, 300).tolist() + np.random.normal(35, 8, 300).tolist(),
'Campaign': ['Google']*300 + ['LinkedIn']*300
})
fig = px.box(data,
x='Campaign',
y='CPL ($)',
title='Cost Per Lead Distribution by Campaign')
fig.show()
✅ Shows LinkedIn has higher median and more variance
📌 Useful conclusion: LinkedIn leads cost more & fluctuate → reconsider budget strategy
Box Plot with Points (Swarm Style)
To see individual data points:
fig = px.box(data,
x='Campaign',
y='CPL ($)',
points='all',
title='CPL Distribution with Data Points')
fig.show()
✅ Shows actual spread + outliers
✅ Crucial for granular audit & anomaly detection
Multiple Category Comparison
Example: CPC across 3 channels:
data = pd.DataFrame({
'CPC ($)': np.random.normal(1.5, .3, 300).tolist()
+ np.random.normal(2.2, .4, 300).tolist()
+ np.random.normal(1.1, .2, 300).tolist(),
'Platform': ['Google']*300 + ['LinkedIn']*300 + ['Meta']*300
})
fig = px.box(data, x='Platform', y='CPC ($)', title="CPC Variation by Platform")
fig.show()
Interpretation:
- Meta lowest CPC
- LinkedIn costliest & more variance
Real Marketing/Business Interpretation Guide
| Signal | Interpretation |
|---|---|
| High median | Campaign generally costly |
| Low median | Efficient targeting |
| Long whiskers | High inconsistency = need optimization |
| Many outliers | Data issues / audience mismatch / bidding error |
Combine Histogram + Box Plot
Plotly can combine both for deeper insight:
fig = px.histogram(data,
x="CPL ($)",
marginal="box",
title="CPL Distribution + Box Insights")
fig.show()
✅ Top-tier visualization technique for dashboards
✅ Helps evaluate ROI volatility
Real-World Case Study
Scenario
You want to analyze lead cost distribution across ad platforms.
| Platform | Median CPL | Variation | Insight |
|---|---|---|---|
| Low | Stable | Best ROI, scale budget | |
| Meta | Medium | Moderate | Balanced, optimize campaign types |
| High | High | Expensive, limit to high-value leads |
Outcome:
Allocate 60% Google, 30% Meta, 10% LinkedIn next month — backed by data.
Summary Table
| Chart | Best For | Insight |
|---|---|---|
| Histogram | Data spread & frequency | Where most values lie |
| Grouped Histogram | Segment comparison | Audience quality by source |
| Box Plot | Quartiles, median, outliers | Stability vs volatility |
| Box + Points | Detailed insight | Actual data distribution |
| Histogram + Box | Best practice for analysts | Complete distribution story |
Final Key Learning
| Concept | Value |
|---|---|
| Histograms | See how many values fall in each range |
| Box Plots | Visualize statistical summary + outliers |
| Business Advantage | Identify profitable segments & anomalies |
What’s Next?
In Chapter 5 – Bubble Charts & Heatmaps in Plotly, you will learn:
- Multi-variable bubble charts
- Heatmaps for correlations & user behavior
- Marketing analytics use cases (CPC vs CTR vs Conversions)
- Website tracking heatmaps & BI dashboards
