Welcome! This chapter gives a practical, hands-on introduction to pandas — the essential Python library for data manipulation and analysis. By the end you’ll understand pandas’ core concepts, be able to load and explore datasets, perform common transformations, and know where to look next.


What is pandas and why use it?

pandas is a high-level library built on top of NumPy that provides fast, flexible, and expressive data structures designed to make working with structured/tabular data both easy and intuitive.

Use pandas when you need to:

  • Load and save data (CSV, Excel, SQL, JSON, Parquet).
  • Clean, reshape, and transform datasets.
  • Compute summary statistics and group-based aggregations.
  • Work with time-series data.
  • Prepare data for visualization or machine learning.

Install & import

Install with pip (or conda):

pip install pandas
# or
conda install pandas

Import in Python:

import pandas as pd

Core data structures

  1. Series — 1D labeled array (like a column).
  2. DataFrame — 2D labeled table (rows × columns). Primary workhorse.
  3. Index — labels for rows or columns (can be numbers, strings, datetimes).

Example:

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df = pd.DataFrame({
    'name': ['Alice','Bob','Cara'],
    'age': [25, 30, 22],
    'score': [88.5, 92.0, 79.0]
})

Creating & inspecting DataFrames

From CSV:

df = pd.read_csv('data.csv')

Quick inspection:

df.head()         # first 5 rows
df.tail()         # last 5 rows
df.info()         # dtypes, non-null counts
df.describe()     # numeric summary stats
df.shape          # (rows, cols)
df.columns        # column labels

Indexing & selection (essentials)

  • Column access: df['col'] or df.col (when valid name).
  • Row selection by integer position: df.iloc[2] (3rd row).
  • Row/column by label: df.loc[<row_label>, <col_label>].
  • Slicing: df[10:20] or df.loc['2020-01-01':'2020-01-31'] for timeseries.

Examples:

# select columns
ages = df['age']

# boolean selection
high_score = df[df['score'] > 90]

# select specific rows & columns
subset = df.loc[1:3, ['name','score']]

Common operations

  • Add / transform columns df['passed'] = df['score'] >= 75 df['age_plus_ten'] = df['age'] + 10
  • Sorting df.sort_values('score', ascending=False)
  • Renaming df.rename(columns={'score':'exam_score'}, inplace=True)
  • Dropping df.drop(['col1','col2'], axis=1, inplace=True)

Handling missing data

  • Detect: df.isna(), df.isnull().sum()
  • Drop rows/cols: df.dropna() or df.dropna(axis=1)
  • Fill missing: df.fillna(0) or forward-fill df.fillna(method='ffill')

Example:

df['age'].fillna(df['age'].median(), inplace=True)

GroupBy and aggregation

Powerful for summarizing data:

grouped = df.groupby('department')['salary'].agg(['mean','median','count'])

Chaining example:

(df[df['score']>=50]
 .groupby('class')['score']
 .agg(['mean','max','min'])
 .reset_index()
)

Merging, joining, concatenation

  • Concatenate rows: pd.concat([df1, df2])
  • Merge like SQL: pd.merge(left, right, on='id', how='inner')
  • Join on index: df1.join(df2, how='left')

Time series basics

  • Convert to datetime: df['date'] = pd.to_datetime(df['date'])
  • Set index: df.set_index('date', inplace=True)
  • Resample: df.resample('M').sum() (monthly)
  • Rolling windows: df['value'].rolling(window=7).mean()

IO: save results

df.to_csv('cleaned.csv', index=False)
df.to_excel('report.xlsx', index=False)
df.to_parquet('data.parquet')

Performance tips

  • Use vectorized operations (df['col'] + 1) instead of Python loops.
  • When memory is tight: specify dtypes (pd.read_csv(..., dtype={'id': 'int32'})).
  • Use categorical dtype for repeated strings: df['cat'] = df['cat'].astype('category').
  • For very large datasets consider dask.dataframe or working in chunks with pd.read_csv(..., chunksize=...).

Small end-to-end example

import pandas as pd

# load
df = pd.read_csv('sales.csv')

# clean
df['date'] = pd.to_datetime(df['date'])
df.dropna(subset=['amount'], inplace=True)

# transform
df['month'] = df['date'].dt.to_period('M')
summary = (df.groupby(['region', 'month'])
             .amount.sum()
             .reset_index()
             .sort_values(['region','month']))

# save
summary.to_csv('monthly_sales_by_region.csv', index=False)

Exercises (quick practice)

  1. Load a CSV, display first 10 rows, and show the number of missing values per column.
  2. Create a new column revenue_per_unit = revenue / units_sold and show the top 5 rows by this metric.
  3. Group sales by product_category and compute sum, mean, and count of revenue.
  4. Convert a timestamp column to datetime, set it as index, and resample daily totals.

Where to go next

  • Deep-dive into advanced indexing (.loc, .iloc, .at, .iat).
  • Detailed time-series handling and window functions.
  • merge/join patterns for complex data models.
  • Performance tuning and using pandas with machine learning pipelines.

Welcome! This chapter gives a practical, hands-on introduction to pandas — the essential Python library for data manipulation and analysis. By the end you’ll understand pandas’ core concepts, be able to load and explore datasets, perform common transformations, and know where to look next.


What is pandas and why use it?

pandas is a high-level library built on top of NumPy that provides fast, flexible, and expressive data structures designed to make working with structured/tabular data both easy and intuitive.

Use pandas when you need to:

  • Load and save data (CSV, Excel, SQL, JSON, Parquet).
  • Clean, reshape, and transform datasets.
  • Compute summary statistics and group-based aggregations.
  • Work with time-series data.
  • Prepare data for visualization or machine learning.

Install & import

Install with pip (or conda):

pip install pandas
# or
conda install pandas

Import in Python:

import pandas as pd

Core data structures

  1. Series — 1D labeled array (like a column).
  2. DataFrame — 2D labeled table (rows × columns). Primary workhorse.
  3. Index — labels for rows or columns (can be numbers, strings, datetimes).

Example:

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df = pd.DataFrame({
    'name': ['Alice','Bob','Cara'],
    'age': [25, 30, 22],
    'score': [88.5, 92.0, 79.0]
})

Creating & inspecting DataFrames

From CSV:

df = pd.read_csv('data.csv')

Quick inspection:

df.head()         # first 5 rows
df.tail()         # last 5 rows
df.info()         # dtypes, non-null counts
df.describe()     # numeric summary stats
df.shape          # (rows, cols)
df.columns        # column labels

Indexing & selection (essentials)

  • Column access: df['col'] or df.col (when valid name).
  • Row selection by integer position: df.iloc[2] (3rd row).
  • Row/column by label: df.loc[<row_label>, <col_label>].
  • Slicing: df[10:20] or df.loc['2020-01-01':'2020-01-31'] for timeseries.

Examples:

# select columns
ages = df['age']

# boolean selection
high_score = df[df['score'] > 90]

# select specific rows & columns
subset = df.loc[1:3, ['name','score']]

Common operations

  • Add / transform columns df['passed'] = df['score'] >= 75 df['age_plus_ten'] = df['age'] + 10
  • Sorting df.sort_values('score', ascending=False)
  • Renaming df.rename(columns={'score':'exam_score'}, inplace=True)
  • Dropping df.drop(['col1','col2'], axis=1, inplace=True)

Handling missing data

  • Detect: df.isna(), df.isnull().sum()
  • Drop rows/cols: df.dropna() or df.dropna(axis=1)
  • Fill missing: df.fillna(0) or forward-fill df.fillna(method='ffill')

Example:

df['age'].fillna(df['age'].median(), inplace=True)

GroupBy and aggregation

Powerful for summarizing data:

grouped = df.groupby('department')['salary'].agg(['mean','median','count'])

Chaining example:

(df[df['score']>=50]
 .groupby('class')['score']
 .agg(['mean','max','min'])
 .reset_index()
)

Merging, joining, concatenation

  • Concatenate rows: pd.concat([df1, df2])
  • Merge like SQL: pd.merge(left, right, on='id', how='inner')
  • Join on index: df1.join(df2, how='left')

Time series basics

  • Convert to datetime: df['date'] = pd.to_datetime(df['date'])
  • Set index: df.set_index('date', inplace=True)
  • Resample: df.resample('M').sum() (monthly)
  • Rolling windows: df['value'].rolling(window=7).mean()

IO: save results

df.to_csv('cleaned.csv', index=False)
df.to_excel('report.xlsx', index=False)
df.to_parquet('data.parquet')

Performance tips

  • Use vectorized operations (df['col'] + 1) instead of Python loops.
  • When memory is tight: specify dtypes (pd.read_csv(..., dtype={'id': 'int32'})).
  • Use categorical dtype for repeated strings: df['cat'] = df['cat'].astype('category').
  • For very large datasets consider dask.dataframe or working in chunks with pd.read_csv(..., chunksize=...).

Small end-to-end example

import pandas as pd

# load
df = pd.read_csv('sales.csv')

# clean
df['date'] = pd.to_datetime(df['date'])
df.dropna(subset=['amount'], inplace=True)

# transform
df['month'] = df['date'].dt.to_period('M')
summary = (df.groupby(['region', 'month'])
             .amount.sum()
             .reset_index()
             .sort_values(['region','month']))

# save
summary.to_csv('monthly_sales_by_region.csv', index=False)

Exercises (quick practice)

  1. Load a CSV, display first 10 rows, and show the number of missing values per column.
  2. Create a new column revenue_per_unit = revenue / units_sold and show the top 5 rows by this metric.
  3. Group sales by product_category and compute sum, mean, and count of revenue.
  4. Convert a timestamp column to datetime, set it as index, and resample daily totals.

Where to go next

  • Deep-dive into advanced indexing (.loc, .iloc, .at, .iat).
  • Detailed time-series handling and window functions.
  • merge/join patterns for complex data models.
  • Performance tuning and using pandas with machine learning pipelines.