Python Regular Expressions
Regular expressions are a powerful technique for matching text patterns in python. Python "re" module provides support for regular expression. re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned
# Searching for Basic patterns :
str1 = "This is some text which is having phone number 908-897-1234. This is the phone number you can find in phone directory"
'phone' in str1
# output : True
# re : Python regular expressions
import re
pattern = 'phone'
re.search(pattern,str1)
Output : < re.Match object; span=(34, 39), match='phone' >
# re : get the index of start and end
import re
pattern = 'phone'
match = re.search(pattern,str1)
match.span() # Output : (34, 39)
match.start() # Output : 34
match.end() # Output : 39
re.search(pattern,text) only matches the first occurance. If we want a list of all the matches, we can use re.finalall() method:
# re.findall()
matches = re.findall('phone', str1)
matches # Output : ['phone', 'phone', 'phone']
len(matches) # Output : 3
# re.finditer() : To get actual match objects, use the iterator
for match in re.finditer('phone',str1):
print(match)
print(match.span())
print(match.group())
# Output :
< re.Match object; span=(34, 39), match='phone' >
(34, 39)
phone
< re.Match object; span=(73, 78), match='phone' >
(73, 78)
phone
< re.Match object; span=(102, 107), match='phone' >
(102, 107)
phone
Identifiers for Characters in Patterns
Characters such as a digit or a single string have different codes that represent them. We can use Identifiers to build up a pattern string.
Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:
r'mypattern'
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.
Character | Description | Example Pattern code | Example Match |
---|---|---|---|
\d | A digit | sometext_\d\d | sometext_43 |
\D | A non digit | \D\D\D | ABC |
\w | Alphanumeric | \w-\w\w\w | A-b_1 |
\W | non-alphanumeric | \W\W\W\W\W\W | *-+=) |
\s | White space | a\sb\sc | a b c |
\S | Non-whitespace | \S\S\S\S | Yolo |
Quantifiers :
Character | Description | Example Pattern code | Example Match |
---|---|---|---|
+ | Occurs one or more times | Version \w-\w+ | Version A-b1_1 |
\* | Occurs zero or more times | A\*B\*C* | AAACC |
? | Once or none | plurals? | plural |
{3} | Occurs exactly 3 times | \D{3} | abc |
{2,4} | Occurs 2 to 4 times | \d{2,4} | 123 |
{3,} | Occurs 3 or more | \w{3,} | anycharacters |
# Example : Indetifier and Quantifiers
text = "This is file_05072022 having these_!@#$%^&*() special character and phone number 987-765-2345"
match = re.search(r'file_\d\d\d\d\d\d\d\d',text)
match1 = re.search(r'these_\W\W\W\W\W',text)
print(match.group()) # Output : file_05072022
print(match1.group()) # Output : these_!@#$%
Groups :
We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).
# Groups
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)
results.group() # Output : '987-765-2345'
results.group(1) # Output : '987'
Additional Regex Syntax
Or Operator |
use pipe operator "|" to have an or statement.
# or "|"
results = re.search('I|We|He|She|They',"They can do this !!!")
print(results) # Output : < re.Match object; span=(0, 4), match='They' >
The Wildcard Character
We can use a "wildcard" as a placement that will match any character placed there. We can use a simple period . for this.
# "." Character
re.findall(r"...at","The cat in the hat sat here over the mat.") # Note : One .(dot) will match one character
# Output : ['e cat', 'e hat', 'e mat']
Starts with and Ends With
We can use the ^ to signal starts with, and the $ to signal ends with
# Ends with a number
re.findall(r'\d$','This ends with a number 2') # Output : ['2']
# starts with a number
re.findall(r'^\d','1 is the loneliest number.') # Output : ['1']
Exclusion
# Exclusion :
str1 = "there are 3 numbers 34 inside 5 this sentence."
re.findall(r'[^\d]',str1)
re.findall(r'[^\d]+',str1) # Note: + is use to concate the words together.
# Output : ['there are ', ' numbers ', ' inside ', ' this sentence.']
# remove punctuation from a sentence.
str2 = 'This is a string! But it has punctuation. How can we remove it?'
re.findall('[^!.? ]+',str2)
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))
print(clean) # Output : 'This is a string But it has punctuation How can we remove it'
Brackets for Grouping
# ..
re.findall(r'[\w]+-[\w]+',text)
# Output : ['hypen-words', 'long-ish']
Parenthesis for Multiple Options
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"
re.search(r'cat(fish|nap|claw)',text) # Output : < _sre.SRE_Match object; span=(27, 34), match='catfish' >
re.search(r'cat(fish|nap|claw)',texttwo) # Output : < _sre.SRE_Match object; span=(32, 38), match='catnap' >
re.search(r'cat(fish|nap|claw)',textthree) # Output : None