Working with text data is one of the most common and essential parts of data analysis. In real-world datasets—whether it’s customer names, product reviews, email addresses, or location data—strings play a central role. Python’s Pandas library provides a powerful and flexible way to handle textual data using vectorized string functions through the .str
accessor.
These functions are efficient, expressive, and much faster than using Python’s native string methods in loops.
In this chapter, you’ll learn how to clean, manipulate, and extract information from text data using Pandas string functions.
⚙️ Creating a Sample Dataset
Let’s begin by creating a simple Pandas DataFrame for our examples:
import pandas as pd
data = {
'Name': ['Alice Johnson', 'bob smith', 'CHARLIE ADAMS', 'David Jones', None],
'Email': ['alice@gmail.com', 'bob_smith@yahoo.com', 'charlie@outlook.com', 'david@gmail.com', 'eve@hotmail.com'],
'City': ['New York', 'los angeles', 'CHICAGO', 'Houston', 'seattle'],
'Review': ['Excellent service!', 'average product', 'BAD experience', 'Good support', '']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Email City Review
0 Alice Johnson alice@gmail.com New York Excellent service!
1 bob smith bob_smith@yahoo.com los angeles average product
2 CHARLIE ADAMS charlie@outlook.com CHICAGO BAD experience
3 David Jones david@gmail.com Houston Good support
4 NaN eve@hotmail.com seattle
🔡 1. Changing Case of Strings
The .str
accessor supports vectorized case conversions:
👉 Convert to Lowercase
df['Name_lower'] = df['Name'].str.lower()
👉 Convert to Uppercase
df['Name_upper'] = df['Name'].str.upper()
👉 Capitalize First Letter of Each Word
df['Name_title'] = df['Name'].str.title()
Output:
Name Name_lower Name_upper Name_title
0 Alice Johnson alice johnson ALICE JOHNSON Alice Johnson
1 bob smith bob smith BOB SMITH Bob Smith
2 CHARLIE ADAMS charlie adams CHARLIE ADAMS Charlie Adams
3 David Jones david jones DAVID JONES David Jones
4 NaN NaN NaN NaN
✂️ 2. Removing Extra Spaces
Sometimes text fields include unwanted leading or trailing spaces.
👉 Using strip()
, lstrip()
, and rstrip()
df['City'] = df['City'].str.strip()
df['City_left'] = df['City'].str.lstrip()
df['City_right'] = df['City'].str.rstrip()
All these remove whitespace efficiently from either side or both sides.
🔍 3. Searching Substrings
Pandas makes it easy to check if a substring exists in a string.
👉 Using contains()
df['is_gmail'] = df['Email'].str.contains('gmail')
Output:
0 True
1 False
2 False
3 True
4 False
You can also use regular expressions:
df['has_numbers'] = df['Name'].str.contains(r'\d', na=False)
This checks if a name contains any digits.
🔄 4. Replacing Substrings
👉 Using replace()
df['Email_provider'] = df['Email'].str.replace('@gmail.com', '@googlemail.com')
👉 Regex-based replacement
df['Review_clean'] = df['Review'].str.replace(r'[!?.]', '', regex=True)
This removes punctuation marks from the review text.
📏 5. Measuring String Length
👉 Using len()
df['Name_length'] = df['Name'].str.len()
Output:
0 13
1 9
2 13
3 11
4 NaN
✂️ 6. Extracting Substrings
👉 Using slice()
or index ranges
df['First_5_chars'] = df['Email'].str.slice(0, 5)
👉 Extracting domain names
df['Domain'] = df['Email'].str.split('@').str[1]
Output:
0 gmail.com
1 yahoo.com
2 outlook.com
3 gmail.com
4 hotmail.com
🎯 7. Extracting Data with Regular Expressions
Pandas .str.extract()
is very powerful for structured text.
Example 1: Extracting the email username
df['Username'] = df['Email'].str.extract(r'(^[\w\.-]+)')
Output:
0 alice
1 bob_smith
2 charlie
3 david
4 eve
Example 2: Extract domain without “.com”
df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')
Output:
0 gmail
1 yahoo
2 outlook
3 gmail
4 hotmail
🧱 8. Concatenating Strings
Combine multiple columns or add custom text.
df['Full_Info'] = df['Name'].str.title() + ' (' + df['Email'] + ')'
Output:
0 Alice Johnson (alice@gmail.com)
1 Bob Smith (bob_smith@yahoo.com)
2 Charlie Adams (charlie@outlook.com)
3 David Jones (david@gmail.com)
4 (eve@hotmail.com)
You can also use .str.cat()
:
df['Name_City'] = df['Name'].str.cat(df['City'], sep=' - ')
🔢 9. Counting Substrings
df['count_a'] = df['Name'].str.count('a', case=False)
Output:
0 1
1 1
2 2
3 2
4 NaN
📚 10. Checking String Patterns
Pandas provides many boolean-check string methods:
Method | Description | Example |
---|---|---|
startswith() | Checks if a string starts with a pattern | df['Email'].str.startswith('a') |
endswith() | Checks if a string ends with a pattern | df['Email'].str.endswith('.com') |
isalpha() | True if all characters are alphabets | df['Name'].str.replace(' ', '').str.isalpha() |
isdigit() | True if all are digits | df['Name'].str.isdigit() |
isalnum() | True if all characters are alphanumeric | df['Name'].str.replace(' ', '').str.isalnum() |
🧹 11. Splitting and Expanding Text Columns
👉 Split string into multiple columns
df[['First_Name', 'Last_Name']] = df['Name'].str.split(' ', n=1, expand=True)
Output:
First_Name Last_Name
0 Alice Johnson
1 bob smith
2 CHARLIE ADAMS
3 David Jones
4 None NaN
👉 Access parts of the split
df['City_First_Word'] = df['City'].str.split().str[0]
🔠 12. String Alignment
👉 Left, right, and center alignment
df['Left_align'] = df['City'].str.ljust(15, '*')
df['Right_align'] = df['City'].str.rjust(15, '*')
df['Center_align'] = df['City'].str.center(15, '*')
This is especially useful for formatting reports.
🧠 13. Handling Missing Values in String Columns
Pandas string functions automatically skip NaN
, but you can fill them explicitly:
df['Name_filled'] = df['Name'].fillna('Unknown')
df['Name_clean'] = df['Name_filled'].str.strip().str.title()
📊 14. Practical Example – Cleaning Customer Data
Let’s combine what we learned:
customers = pd.DataFrame({
'FullName': [' john doe ', 'MARY SMITH', ' Alice Johnson', None],
'Email': ['JOHN@GMAIL.COM', 'mary@Yahoo.Com', 'alice@outlook.com', '']
})
customers['FullName'] = customers['FullName'].str.strip().str.title()
customers['Email'] = customers['Email'].str.lower()
customers['Username'] = customers['Email'].str.extract(r'(^[\w\.-]+)')
customers['Provider'] = customers['Email'].str.extract(r'@([a-z]+)')
Result:
FullName Email Username Provider
0 John Doe john@gmail.com john gmail
1 Mary Smith mary@yahoo.com mary yahoo
2 Alice Johnson alice@outlook.com alice outlook
3 None NaN NaN
✅ Clean and ready-to-analyze customer dataset.
🧮 15. Combining with apply()
for Custom Operations
You can combine string functions with Python lambdas for flexibility.
df['Initials'] = df['Name'].apply(lambda x: ''.join([n[0].upper() for n in str(x).split() if n]))
Output:
0 AJ
1 BS
2 CA
3 DJ
4 N
🪄 16. Using str.extractall()
for Multiple Matches
Suppose you have multiple numbers in a string:
data = pd.DataFrame({'Text': ['Order 123 and 456', 'ID 789 only']})
matches = data['Text'].str.extractall(r'(\d+)')
print(matches)
Output:
0
match
0 0 123
1 456
1 0 789
Each match is stored as a separate row with an additional index level (match
).
🧰 17. Combining String Functions with groupby()
You can group text-based data efficiently:
df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')
provider_count = df.groupby('Provider').size()
print(provider_count)
Output:
Provider
gmail 2
hotmail 1
outlook 1
yahoo 1
dtype: int64
🧩 18. String Aggregation
Use .str.cat()
with groupby:
names_joined = df.groupby('Provider')['Name'].apply(lambda x: ', '.join(x.dropna()))
print(names_joined)
Output:
Provider
gmail Alice Johnson, David Jones
hotmail NaN
outlook CHARLIE ADAMS
yahoo bob smith
📖 19. Summary Table of Common Pandas String Functions
Category | Function | Description |
---|---|---|
Case | .str.lower() , .str.upper() , .str.title() | Change letter casing |
Whitespace | .str.strip() , .str.lstrip() , .str.rstrip() | Remove spaces |
Search | .str.contains() , .str.startswith() , .str.endswith() | Search patterns |
Replace | .str.replace() | Replace substrings |
Split/Join | .str.split() , .str.cat() | Split or concatenate |
Extract | .str.extract() , .str.extractall() | Extract substrings via regex |
Length | .str.len() | Compute string length |
Count | .str.count() | Count occurrences |
Validation | .str.isalpha() , .str.isdigit() , .str.isalnum() | Check character types |
🧾 20. Conclusion
Text data cleaning is one of the most time-consuming parts of data science projects. Fortunately, Pandas string functions make this process smooth, efficient, and highly expressive.
With the .str
accessor, you can apply transformations, extract structured data from unstructured text, and prepare textual data for analytics and machine learning tasks.
Remember:
- Use vectorized methods (not loops) for performance.
- Combine regex and .str.extract() for structured pattern extraction.
- Always clean missing and inconsistent cases before analysis.