Working with text data is one of the most common and essential parts of data analysis. In real-world datasets—whether it’s customer names, product reviews, email addresses, or location data—strings play a central role. Python’s Pandas library provides a powerful and flexible way to handle textual data using vectorized string functions through the .str accessor.

These functions are efficient, expressive, and much faster than using Python’s native string methods in loops.

In this chapter, you’ll learn how to clean, manipulate, and extract information from text data using Pandas string functions.


⚙️ Creating a Sample Dataset

Let’s begin by creating a simple Pandas DataFrame for our examples:

import pandas as pd

data = {
    'Name': ['Alice Johnson', 'bob smith', 'CHARLIE ADAMS', 'David Jones', None],
    'Email': ['alice@gmail.com', 'bob_smith@yahoo.com', 'charlie@outlook.com', 'david@gmail.com', 'eve@hotmail.com'],
    'City': ['New York', 'los angeles', 'CHICAGO', 'Houston', 'seattle'],
    'Review': ['Excellent service!', 'average product', 'BAD experience', 'Good support', '']
}

df = pd.DataFrame(data)
print(df)

Output:

            Name                 Email        City             Review
0   Alice Johnson     alice@gmail.com     New York  Excellent service!
1       bob smith    bob_smith@yahoo.com  los angeles   average product
2   CHARLIE ADAMS  charlie@outlook.com      CHICAGO     BAD experience
3      David Jones      david@gmail.com     Houston       Good support
4              NaN        eve@hotmail.com    seattle

🔡 1. Changing Case of Strings

The .str accessor supports vectorized case conversions:

👉 Convert to Lowercase

df['Name_lower'] = df['Name'].str.lower()

👉 Convert to Uppercase

df['Name_upper'] = df['Name'].str.upper()

👉 Capitalize First Letter of Each Word

df['Name_title'] = df['Name'].str.title()

Output:

            Name      Name_lower       Name_upper       Name_title
0   Alice Johnson   alice johnson   ALICE JOHNSON   Alice Johnson
1       bob smith       bob smith       BOB SMITH       Bob Smith
2   CHARLIE ADAMS   charlie adams   CHARLIE ADAMS   Charlie Adams
3      David Jones    david jones    DAVID JONES    David Jones
4             NaN              NaN              NaN              NaN

✂️ 2. Removing Extra Spaces

Sometimes text fields include unwanted leading or trailing spaces.

👉 Using strip(), lstrip(), and rstrip()

df['City'] = df['City'].str.strip()
df['City_left'] = df['City'].str.lstrip()
df['City_right'] = df['City'].str.rstrip()

All these remove whitespace efficiently from either side or both sides.


🔍 3. Searching Substrings

Pandas makes it easy to check if a substring exists in a string.

👉 Using contains()

df['is_gmail'] = df['Email'].str.contains('gmail')

Output:

0     True
1    False
2    False
3     True
4    False

You can also use regular expressions:

df['has_numbers'] = df['Name'].str.contains(r'\d', na=False)

This checks if a name contains any digits.


🔄 4. Replacing Substrings

👉 Using replace()

df['Email_provider'] = df['Email'].str.replace('@gmail.com', '@googlemail.com')

👉 Regex-based replacement

df['Review_clean'] = df['Review'].str.replace(r'[!?.]', '', regex=True)

This removes punctuation marks from the review text.


📏 5. Measuring String Length

👉 Using len()

df['Name_length'] = df['Name'].str.len()

Output:

0    13
1     9
2    13
3    11
4   NaN

✂️ 6. Extracting Substrings

👉 Using slice() or index ranges

df['First_5_chars'] = df['Email'].str.slice(0, 5)

👉 Extracting domain names

df['Domain'] = df['Email'].str.split('@').str[1]

Output:

0       gmail.com
1      yahoo.com
2    outlook.com
3       gmail.com
4     hotmail.com

🎯 7. Extracting Data with Regular Expressions

Pandas .str.extract() is very powerful for structured text.

Example 1: Extracting the email username

df['Username'] = df['Email'].str.extract(r'(^[\w\.-]+)')

Output:

0      alice
1    bob_smith
2    charlie
3      david
4        eve

Example 2: Extract domain without “.com”

df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')

Output:

0      gmail
1      yahoo
2    outlook
3      gmail
4    hotmail

🧱 8. Concatenating Strings

Combine multiple columns or add custom text.

df['Full_Info'] = df['Name'].str.title() + ' (' + df['Email'] + ')'

Output:

0     Alice Johnson (alice@gmail.com)
1          Bob Smith (bob_smith@yahoo.com)
2    Charlie Adams (charlie@outlook.com)
3        David Jones (david@gmail.com)
4                    (eve@hotmail.com)

You can also use .str.cat():

df['Name_City'] = df['Name'].str.cat(df['City'], sep=' - ')

🔢 9. Counting Substrings

df['count_a'] = df['Name'].str.count('a', case=False)

Output:

0    1
1    1
2    2
3    2
4  NaN

📚 10. Checking String Patterns

Pandas provides many boolean-check string methods:

MethodDescriptionExample
startswith()Checks if a string starts with a patterndf['Email'].str.startswith('a')
endswith()Checks if a string ends with a patterndf['Email'].str.endswith('.com')
isalpha()True if all characters are alphabetsdf['Name'].str.replace(' ', '').str.isalpha()
isdigit()True if all are digitsdf['Name'].str.isdigit()
isalnum()True if all characters are alphanumericdf['Name'].str.replace(' ', '').str.isalnum()

🧹 11. Splitting and Expanding Text Columns

👉 Split string into multiple columns

df[['First_Name', 'Last_Name']] = df['Name'].str.split(' ', n=1, expand=True)

Output:

    First_Name  Last_Name
0        Alice    Johnson
1          bob      smith
2      CHARLIE      ADAMS
3        David      Jones
4         None        NaN

👉 Access parts of the split

df['City_First_Word'] = df['City'].str.split().str[0]

🔠 12. String Alignment

👉 Left, right, and center alignment

df['Left_align'] = df['City'].str.ljust(15, '*')
df['Right_align'] = df['City'].str.rjust(15, '*')
df['Center_align'] = df['City'].str.center(15, '*')

This is especially useful for formatting reports.


🧠 13. Handling Missing Values in String Columns

Pandas string functions automatically skip NaN, but you can fill them explicitly:

df['Name_filled'] = df['Name'].fillna('Unknown')
df['Name_clean'] = df['Name_filled'].str.strip().str.title()

📊 14. Practical Example – Cleaning Customer Data

Let’s combine what we learned:

customers = pd.DataFrame({
    'FullName': [' john doe ', 'MARY SMITH', '   Alice Johnson', None],
    'Email': ['JOHN@GMAIL.COM', 'mary@Yahoo.Com', 'alice@outlook.com', '']
})

customers['FullName'] = customers['FullName'].str.strip().str.title()
customers['Email'] = customers['Email'].str.lower()
customers['Username'] = customers['Email'].str.extract(r'(^[\w\.-]+)')
customers['Provider'] = customers['Email'].str.extract(r'@([a-z]+)')

Result:

       FullName              Email  Username  Provider
0       John Doe     john@gmail.com      john     gmail
1     Mary Smith     mary@yahoo.com      mary     yahoo
2  Alice Johnson  alice@outlook.com     alice   outlook
3         None                               NaN       NaN

✅ Clean and ready-to-analyze customer dataset.


🧮 15. Combining with apply() for Custom Operations

You can combine string functions with Python lambdas for flexibility.

df['Initials'] = df['Name'].apply(lambda x: ''.join([n[0].upper() for n in str(x).split() if n]))

Output:

0    AJ
1    BS
2    CA
3    DJ
4    N

🪄 16. Using str.extractall() for Multiple Matches

Suppose you have multiple numbers in a string:

data = pd.DataFrame({'Text': ['Order 123 and 456', 'ID 789 only']})
matches = data['Text'].str.extractall(r'(\d+)')
print(matches)

Output:

          0
match
0 0     123
  1     456
1 0     789

Each match is stored as a separate row with an additional index level (match).


🧰 17. Combining String Functions with groupby()

You can group text-based data efficiently:

df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')
provider_count = df.groupby('Provider').size()
print(provider_count)

Output:

Provider
gmail      2
hotmail    1
outlook    1
yahoo      1
dtype: int64

🧩 18. String Aggregation

Use .str.cat() with groupby:

names_joined = df.groupby('Provider')['Name'].apply(lambda x: ', '.join(x.dropna()))
print(names_joined)

Output:

Provider
gmail      Alice Johnson, David Jones
hotmail                    NaN
outlook             CHARLIE ADAMS
yahoo                    bob smith

📖 19. Summary Table of Common Pandas String Functions

CategoryFunctionDescription
Case.str.lower(), .str.upper(), .str.title()Change letter casing
Whitespace.str.strip(), .str.lstrip(), .str.rstrip()Remove spaces
Search.str.contains(), .str.startswith(), .str.endswith()Search patterns
Replace.str.replace()Replace substrings
Split/Join.str.split(), .str.cat()Split or concatenate
Extract.str.extract(), .str.extractall()Extract substrings via regex
Length.str.len()Compute string length
Count.str.count()Count occurrences
Validation.str.isalpha(), .str.isdigit(), .str.isalnum()Check character types

🧾 20. Conclusion

Text data cleaning is one of the most time-consuming parts of data science projects. Fortunately, Pandas string functions make this process smooth, efficient, and highly expressive.
With the .str accessor, you can apply transformations, extract structured data from unstructured text, and prepare textual data for analytics and machine learning tasks.

Remember:

  • Use vectorized methods (not loops) for performance.
  • Combine regex and .str.extract() for structured pattern extraction.
  • Always clean missing and inconsistent cases before analysis.

Working with text data is one of the most common and essential parts of data analysis. In real-world datasets—whether it’s customer names, product reviews, email addresses, or location data—strings play a central role. Python’s Pandas library provides a powerful and flexible way to handle textual data using vectorized string functions through the .str accessor.

These functions are efficient, expressive, and much faster than using Python’s native string methods in loops.

In this chapter, you’ll learn how to clean, manipulate, and extract information from text data using Pandas string functions.


⚙️ Creating a Sample Dataset

Let’s begin by creating a simple Pandas DataFrame for our examples:

import pandas as pd

data = {
    'Name': ['Alice Johnson', 'bob smith', 'CHARLIE ADAMS', 'David Jones', None],
    'Email': ['alice@gmail.com', 'bob_smith@yahoo.com', 'charlie@outlook.com', 'david@gmail.com', 'eve@hotmail.com'],
    'City': ['New York', 'los angeles', 'CHICAGO', 'Houston', 'seattle'],
    'Review': ['Excellent service!', 'average product', 'BAD experience', 'Good support', '']
}

df = pd.DataFrame(data)
print(df)

Output:

            Name                 Email        City             Review
0   Alice Johnson     alice@gmail.com     New York  Excellent service!
1       bob smith    bob_smith@yahoo.com  los angeles   average product
2   CHARLIE ADAMS  charlie@outlook.com      CHICAGO     BAD experience
3      David Jones      david@gmail.com     Houston       Good support
4              NaN        eve@hotmail.com    seattle

🔡 1. Changing Case of Strings

The .str accessor supports vectorized case conversions:

👉 Convert to Lowercase

df['Name_lower'] = df['Name'].str.lower()

👉 Convert to Uppercase

df['Name_upper'] = df['Name'].str.upper()

👉 Capitalize First Letter of Each Word

df['Name_title'] = df['Name'].str.title()

Output:

            Name      Name_lower       Name_upper       Name_title
0   Alice Johnson   alice johnson   ALICE JOHNSON   Alice Johnson
1       bob smith       bob smith       BOB SMITH       Bob Smith
2   CHARLIE ADAMS   charlie adams   CHARLIE ADAMS   Charlie Adams
3      David Jones    david jones    DAVID JONES    David Jones
4             NaN              NaN              NaN              NaN

✂️ 2. Removing Extra Spaces

Sometimes text fields include unwanted leading or trailing spaces.

👉 Using strip(), lstrip(), and rstrip()

df['City'] = df['City'].str.strip()
df['City_left'] = df['City'].str.lstrip()
df['City_right'] = df['City'].str.rstrip()

All these remove whitespace efficiently from either side or both sides.


🔍 3. Searching Substrings

Pandas makes it easy to check if a substring exists in a string.

👉 Using contains()

df['is_gmail'] = df['Email'].str.contains('gmail')

Output:

0     True
1    False
2    False
3     True
4    False

You can also use regular expressions:

df['has_numbers'] = df['Name'].str.contains(r'\d', na=False)

This checks if a name contains any digits.


🔄 4. Replacing Substrings

👉 Using replace()

df['Email_provider'] = df['Email'].str.replace('@gmail.com', '@googlemail.com')

👉 Regex-based replacement

df['Review_clean'] = df['Review'].str.replace(r'[!?.]', '', regex=True)

This removes punctuation marks from the review text.


📏 5. Measuring String Length

👉 Using len()

df['Name_length'] = df['Name'].str.len()

Output:

0    13
1     9
2    13
3    11
4   NaN

✂️ 6. Extracting Substrings

👉 Using slice() or index ranges

df['First_5_chars'] = df['Email'].str.slice(0, 5)

👉 Extracting domain names

df['Domain'] = df['Email'].str.split('@').str[1]

Output:

0       gmail.com
1      yahoo.com
2    outlook.com
3       gmail.com
4     hotmail.com

🎯 7. Extracting Data with Regular Expressions

Pandas .str.extract() is very powerful for structured text.

Example 1: Extracting the email username

df['Username'] = df['Email'].str.extract(r'(^[\w\.-]+)')

Output:

0      alice
1    bob_smith
2    charlie
3      david
4        eve

Example 2: Extract domain without “.com”

df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')

Output:

0      gmail
1      yahoo
2    outlook
3      gmail
4    hotmail

🧱 8. Concatenating Strings

Combine multiple columns or add custom text.

df['Full_Info'] = df['Name'].str.title() + ' (' + df['Email'] + ')'

Output:

0     Alice Johnson (alice@gmail.com)
1          Bob Smith (bob_smith@yahoo.com)
2    Charlie Adams (charlie@outlook.com)
3        David Jones (david@gmail.com)
4                    (eve@hotmail.com)

You can also use .str.cat():

df['Name_City'] = df['Name'].str.cat(df['City'], sep=' - ')

🔢 9. Counting Substrings

df['count_a'] = df['Name'].str.count('a', case=False)

Output:

0    1
1    1
2    2
3    2
4  NaN

📚 10. Checking String Patterns

Pandas provides many boolean-check string methods:

MethodDescriptionExample
startswith()Checks if a string starts with a patterndf['Email'].str.startswith('a')
endswith()Checks if a string ends with a patterndf['Email'].str.endswith('.com')
isalpha()True if all characters are alphabetsdf['Name'].str.replace(' ', '').str.isalpha()
isdigit()True if all are digitsdf['Name'].str.isdigit()
isalnum()True if all characters are alphanumericdf['Name'].str.replace(' ', '').str.isalnum()

🧹 11. Splitting and Expanding Text Columns

👉 Split string into multiple columns

df[['First_Name', 'Last_Name']] = df['Name'].str.split(' ', n=1, expand=True)

Output:

    First_Name  Last_Name
0        Alice    Johnson
1          bob      smith
2      CHARLIE      ADAMS
3        David      Jones
4         None        NaN

👉 Access parts of the split

df['City_First_Word'] = df['City'].str.split().str[0]

🔠 12. String Alignment

👉 Left, right, and center alignment

df['Left_align'] = df['City'].str.ljust(15, '*')
df['Right_align'] = df['City'].str.rjust(15, '*')
df['Center_align'] = df['City'].str.center(15, '*')

This is especially useful for formatting reports.


🧠 13. Handling Missing Values in String Columns

Pandas string functions automatically skip NaN, but you can fill them explicitly:

df['Name_filled'] = df['Name'].fillna('Unknown')
df['Name_clean'] = df['Name_filled'].str.strip().str.title()

📊 14. Practical Example – Cleaning Customer Data

Let’s combine what we learned:

customers = pd.DataFrame({
    'FullName': [' john doe ', 'MARY SMITH', '   Alice Johnson', None],
    'Email': ['JOHN@GMAIL.COM', 'mary@Yahoo.Com', 'alice@outlook.com', '']
})

customers['FullName'] = customers['FullName'].str.strip().str.title()
customers['Email'] = customers['Email'].str.lower()
customers['Username'] = customers['Email'].str.extract(r'(^[\w\.-]+)')
customers['Provider'] = customers['Email'].str.extract(r'@([a-z]+)')

Result:

       FullName              Email  Username  Provider
0       John Doe     john@gmail.com      john     gmail
1     Mary Smith     mary@yahoo.com      mary     yahoo
2  Alice Johnson  alice@outlook.com     alice   outlook
3         None                               NaN       NaN

✅ Clean and ready-to-analyze customer dataset.


🧮 15. Combining with apply() for Custom Operations

You can combine string functions with Python lambdas for flexibility.

df['Initials'] = df['Name'].apply(lambda x: ''.join([n[0].upper() for n in str(x).split() if n]))

Output:

0    AJ
1    BS
2    CA
3    DJ
4    N

🪄 16. Using str.extractall() for Multiple Matches

Suppose you have multiple numbers in a string:

data = pd.DataFrame({'Text': ['Order 123 and 456', 'ID 789 only']})
matches = data['Text'].str.extractall(r'(\d+)')
print(matches)

Output:

          0
match
0 0     123
  1     456
1 0     789

Each match is stored as a separate row with an additional index level (match).


🧰 17. Combining String Functions with groupby()

You can group text-based data efficiently:

df['Provider'] = df['Email'].str.extract(r'@([a-z]+)')
provider_count = df.groupby('Provider').size()
print(provider_count)

Output:

Provider
gmail      2
hotmail    1
outlook    1
yahoo      1
dtype: int64

🧩 18. String Aggregation

Use .str.cat() with groupby:

names_joined = df.groupby('Provider')['Name'].apply(lambda x: ', '.join(x.dropna()))
print(names_joined)

Output:

Provider
gmail      Alice Johnson, David Jones
hotmail                    NaN
outlook             CHARLIE ADAMS
yahoo                    bob smith

📖 19. Summary Table of Common Pandas String Functions

CategoryFunctionDescription
Case.str.lower(), .str.upper(), .str.title()Change letter casing
Whitespace.str.strip(), .str.lstrip(), .str.rstrip()Remove spaces
Search.str.contains(), .str.startswith(), .str.endswith()Search patterns
Replace.str.replace()Replace substrings
Split/Join.str.split(), .str.cat()Split or concatenate
Extract.str.extract(), .str.extractall()Extract substrings via regex
Length.str.len()Compute string length
Count.str.count()Count occurrences
Validation.str.isalpha(), .str.isdigit(), .str.isalnum()Check character types

🧾 20. Conclusion

Text data cleaning is one of the most time-consuming parts of data science projects. Fortunately, Pandas string functions make this process smooth, efficient, and highly expressive.
With the .str accessor, you can apply transformations, extract structured data from unstructured text, and prepare textual data for analytics and machine learning tasks.

Remember:

  • Use vectorized methods (not loops) for performance.
  • Combine regex and .str.extract() for structured pattern extraction.
  • Always clean missing and inconsistent cases before analysis.