Learning Spirit

Python Regular Expressions

Regular expressions are a powerful technique for matching text patterns in python. Python "re" module provides support for regular expression. re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned

	
#  Searching for Basic patterns :  

	str1 = "This is some text which is having phone number 908-897-1234. This is the phone number you can find in phone directory"
	
	'phone' in str1
	
	# output : True

	
#  re : Python regular expressions 

	import re
	
	pattern = 'phone'
	re.search(pattern,str1)

	Output : < re.Match object; span=(34, 39), match='phone' >

	
#  re : get the index of start and end 

	import re
	
	pattern = 'phone'
	match = re.search(pattern,str1)
	match.span()	# Output : (34, 39)
	match.start()	# Output : 34
	match.end()	# Output : 39

re.search(pattern,text) only matches the first occurance. If we want a list of all the matches, we can use re.finalall() method:

	
#  re.findall() 

	matches = re.findall('phone', str1)
	matches		# Output : ['phone', 'phone', 'phone']
	len(matches) 	# Output : 3

	
#  re.finditer() : To get actual match objects, use the iterator 

	for match in re.finditer('phone',str1):
		print(match)
		print(match.span())
		print(match.group())

	# Output :
	< re.Match object; span=(34, 39), match='phone' >
	(34, 39)
	phone
	< re.Match object; span=(73, 78), match='phone' >
	(73, 78)
	phone
	< re.Match object; span=(102, 107), match='phone' >
	(102, 107)
	phone

Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. We can use Identifiers to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:
r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Character	Description	Example Pattern code	Example Match
\d	A digit	sometext_\d\d	sometext_43
\D	A non digit	\D\D\D	ABC
\w	Alphanumeric	\w-\w\w\w	A-b_1
\W	non-alphanumeric	\W\W\W\W\W\W	*-+=)
\s	White space	a\sb\sc	a b c
\S	Non-whitespace	\S\S\S\S	Yolo

Quantifiers :

Character	Description	Example Pattern code	Example Match
+	Occurs one or more times	Version \w-\w+	Version A-b1_1
\*	Occurs zero or more times	A\B\C*	AAACC
?	Once or none	plurals?	plural
{3}	Occurs exactly 3 times	\D{3}	abc
{2,4}	Occurs 2 to 4 times	\d{2,4}	123
{3,}	Occurs 3 or more	\w{3,}	anycharacters

	
#  Example : Indetifier and Quantifiers  

	text = "This is file_05072022 having these_!@#$%^&*() special character and phone number 987-765-2345"

	match = re.search(r'file_\d\d\d\d\d\d\d\d',text)
	match1 = re.search(r'these_\W\W\W\W\W',text)

	print(match.group())		# Output : file_05072022
	print(match1.group())		# Output : these_!@#$%

Groups :

We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

	
#  Groups  

	phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

	results = re.search(phone_pattern,text)

	results.group() 		# Output : '987-765-2345'
	results.group(1) 		# Output : '987'

Additional Regex Syntax

Or Operator |

use pipe operator "|" to have an or statement.

	
#  or "|"  

	results = re.search('I|We|He|She|They',"They can do this !!!")
	print(results)		# Output : < re.Match object; span=(0, 4), match='They' >

The Wildcard Character

We can use a "wildcard" as a placement that will match any character placed there. We can use a simple period . for this.

	
#  "." Character 

	re.findall(r"...at","The cat in the hat sat here over the mat.")		# Note : One .(dot) will match one character 
	# Output : ['e cat', 'e hat', 'e mat']

Starts with and Ends With

We can use the ^ to signal starts with, and the $ to signal ends with

	
	#  Ends with a number 

	re.findall(r'\d$','This ends with a number 2')		# Output : ['2']

	
	#  starts with a number 

	re.findall(r'^\d','1 is the loneliest number.')		# Output : ['1']

Exclusion

	
	#  Exclusion :  
	
	str1 = "there are 3 numbers 34 inside 5 this sentence."
	re.findall(r'[^\d]',str1)
	re.findall(r'[^\d]+',str1)	# Note: + is use to concate the words together. 
	# Output : ['there are ', ' numbers ', ' inside ', ' this sentence.']  

#  remove punctuation from a sentence.  
	str2 = 'This is a string! But it has punctuation. How can we remove it?'
	re.findall('[^!.? ]+',str2)

	clean = ' '.join(re.findall('[^!.? ]+',test_phrase)) 
	print(clean)	# Output :  'This is a string But it has punctuation How can we remove it'

Brackets for Grouping

		
		#  .. 
	
		re.findall(r'[\w]+-[\w]+',text)
		# Output : ['hypen-words', 'long-ish']

Parenthesis for Multiple Options

		
		#  Find words that start with cat and end with one of these options: 'fish','nap', or 'claw' 
	
		text = 'Hello, would you like some catfish?'
		texttwo = "Hello, would you like to take a catnap?"
		textthree = "Hello, have you seen this caterpillar?"

		re.search(r'cat(fish|nap|claw)',text)		# Output : < _sre.SRE_Match object; span=(27, 34), match='catfish' > 
		re.search(r'cat(fish|nap|claw)',texttwo)		# Output : < _sre.SRE_Match object; span=(32, 38), match='catnap' > 
		re.search(r'cat(fish|nap|claw)',textthree)		# Output : None