Welcome! This chapter gives a practical, hands-on introduction to pandas — the essential Python library for data manipulation and analysis. By the end you’ll understand pandas’ core concepts, be able to load and explore datasets, perform common transformations, and know where to look next.
What is pandas and why use it?
pandas
is a high-level library built on top of NumPy that provides fast, flexible, and expressive data structures designed to make working with structured/tabular data both easy and intuitive.
Use pandas when you need to:
- Load and save data (CSV, Excel, SQL, JSON, Parquet).
- Clean, reshape, and transform datasets.
- Compute summary statistics and group-based aggregations.
- Work with time-series data.
- Prepare data for visualization or machine learning.
Install & import
Install with pip (or conda):
pip install pandas
# or
conda install pandas
Import in Python:
import pandas as pd
Core data structures
- Series — 1D labeled array (like a column).
- DataFrame — 2D labeled table (rows × columns). Primary workhorse.
- Index — labels for rows or columns (can be numbers, strings, datetimes).
Example:
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df = pd.DataFrame({
'name': ['Alice','Bob','Cara'],
'age': [25, 30, 22],
'score': [88.5, 92.0, 79.0]
})
Creating & inspecting DataFrames
From CSV:
df = pd.read_csv('data.csv')
Quick inspection:
df.head() # first 5 rows
df.tail() # last 5 rows
df.info() # dtypes, non-null counts
df.describe() # numeric summary stats
df.shape # (rows, cols)
df.columns # column labels
Indexing & selection (essentials)
- Column access:
df['col']
ordf.col
(when valid name). - Row selection by integer position:
df.iloc[2]
(3rd row). - Row/column by label:
df.loc[<row_label>, <col_label>]
. - Slicing:
df[10:20]
ordf.loc['2020-01-01':'2020-01-31']
for timeseries.
Examples:
# select columns
ages = df['age']
# boolean selection
high_score = df[df['score'] > 90]
# select specific rows & columns
subset = df.loc[1:3, ['name','score']]
Common operations
- Add / transform columns
df['passed'] = df['score'] >= 75 df['age_plus_ten'] = df['age'] + 10
- Sorting
df.sort_values('score', ascending=False)
- Renaming
df.rename(columns={'score':'exam_score'}, inplace=True)
- Dropping
df.drop(['col1','col2'], axis=1, inplace=True)
Handling missing data
- Detect:
df.isna()
,df.isnull().sum()
- Drop rows/cols:
df.dropna()
ordf.dropna(axis=1)
- Fill missing:
df.fillna(0)
or forward-filldf.fillna(method='ffill')
Example:
df['age'].fillna(df['age'].median(), inplace=True)
GroupBy and aggregation
Powerful for summarizing data:
grouped = df.groupby('department')['salary'].agg(['mean','median','count'])
Chaining example:
(df[df['score']>=50]
.groupby('class')['score']
.agg(['mean','max','min'])
.reset_index()
)
Merging, joining, concatenation
- Concatenate rows:
pd.concat([df1, df2])
- Merge like SQL:
pd.merge(left, right, on='id', how='inner')
- Join on index:
df1.join(df2, how='left')
Time series basics
- Convert to datetime:
df['date'] = pd.to_datetime(df['date'])
- Set index:
df.set_index('date', inplace=True)
- Resample:
df.resample('M').sum()
(monthly) - Rolling windows:
df['value'].rolling(window=7).mean()
IO: save results
df.to_csv('cleaned.csv', index=False)
df.to_excel('report.xlsx', index=False)
df.to_parquet('data.parquet')
Performance tips
- Use vectorized operations (
df['col'] + 1
) instead of Python loops. - When memory is tight: specify dtypes (
pd.read_csv(..., dtype={'id': 'int32'})
). - Use categorical dtype for repeated strings:
df['cat'] = df['cat'].astype('category')
. - For very large datasets consider
dask.dataframe
or working in chunks withpd.read_csv(..., chunksize=...)
.
Small end-to-end example
import pandas as pd
# load
df = pd.read_csv('sales.csv')
# clean
df['date'] = pd.to_datetime(df['date'])
df.dropna(subset=['amount'], inplace=True)
# transform
df['month'] = df['date'].dt.to_period('M')
summary = (df.groupby(['region', 'month'])
.amount.sum()
.reset_index()
.sort_values(['region','month']))
# save
summary.to_csv('monthly_sales_by_region.csv', index=False)
Exercises (quick practice)
- Load a CSV, display first 10 rows, and show the number of missing values per column.
- Create a new column
revenue_per_unit = revenue / units_sold
and show the top 5 rows by this metric. - Group sales by
product_category
and computesum
,mean
, andcount
ofrevenue
. - Convert a
timestamp
column to datetime, set it as index, and resample daily totals.
Where to go next
- Deep-dive into advanced indexing (
.loc
,.iloc
,.at
,.iat
). - Detailed time-series handling and window functions.
merge
/join
patterns for complex data models.- Performance tuning and using
pandas
with machine learning pipelines.