Introduction to Python Regular Expressions
Text processing is one of the most in-demand tasks in programming. Regular expressions become an indispensable tool when solving various problems. This includes parsing HTML pages, validating data formats, or processing logs.
Python uses the built-in re module for working with regular expressions. In this guide, we'll dive deep into using regular expressions in Python. We'll cover popular methods like re.search() and re.sub(), and show their practical application with real-world examples.
Regular Expression Basics
What Are Regular Expressions?
Regular expressions (RegEx) are a special language for describing search patterns in text. They allow you to search, validate, and replace text fragments based on specific rules.
Common Use Cases for Regular Expressions
Regular expressions in Python solve a wide range of tasks:
- Validating email addresses and phone numbers
- Finding all numeric values in text
- Replacing unwanted characters and cleaning data
- Extracting specific words or phrases from large texts
- Parsing structured data
- Processing logs and system files
Getting Started
Importing the re Module
Before working with regular expressions, you need to import the dedicated module:
import re
Basic Symbols and Constructs
To work effectively with regular expressions, it's important to know the basic symbols and their meanings:
| Symbol | Meaning |
|---|---|
| . | Any character except newline |
| \d | Any digit (0-9) |
| \D | Any non-digit character |
| \w | Letter, digit, or underscore |
| \W | Any character except \w |
| \s | Space, tab, or newline |
| \S | Any non-whitespace character |
| ^ | Start of string |
| $ | End of string |
| [] | Character from a specified set |
| * | Zero or more repetitions |
| + | One or more repetitions |
| {n,m} | Between n and m repetitions |
Core Methods for Working with Regular Expressions
Using re.search() to Find the First Match
The re.search() method searches for the first occurrence of a pattern in a string. It returns a Match object if found, or None if no match is found.
import re
text = "Email: example@mail.com"
match = re.search(r'\w+@\w+\.\w+', text)
if match:
print("Found email:", match.group())
Output:
Found email: example@mail.com
Breaking Down the Search Pattern
Let's analyze the pattern used, piece by piece:
- \w+ — one or more letters, digits, or underscores
- @ — the at symbol (required part of an email)
- \. — a dot (escaped with a backslash)
- \w+ — top-level domain
Using re.findall() to Find All Matches
When you need to find every occurrence of a pattern in text, use the findall() method. It returns a list of all matches found.
text = "Prices: 100 dollars, 250 dollars, 350 dollars"
numbers = re.findall(r'\d+', text)
print(numbers) # ['100', '250', '350']
Using re.sub() for Pattern-Based Replacement
The re.sub() method replaces all occurrences of a specified pattern with a replacement string.