In Python, Regular Expression (RegEx) are patterns used to match character combinations in strings. For example,
^s...e$
Here, we have defined a RegEx pattern. The pattern is: any five letter string starting with s
and ending with e
.
The RegEx pattern ^s...e$
can be used to match against strings:
- sense - match
- shade - match
- seize - match
- Sense - no match
- science - no match
- swift - no match
Example: Python RegEx
To work with RegEx in Python, we first need to import a module named re
.
Let's see an example,
# import re module
import re
# regex pattern
pattern = '^s...e$'
# test string
string1 = 'shade'
string2 = 'science'
# use re.match() to match pattern
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)
# print boolean value
print('shade:', bool(result1)) # True
print('science:', bool(result2)) # False
# Output: shade: True
science: False
In the above example, we first imported a module named re
and used the re.match()
function to search for the pattern.
Here, re.match()
take two parameters:
pattern
- the regular expression to be matchedstring1
/string2
- the string in which the pattern is checked
The pattern ^s...e$
means any five letter string starting with s
and ending with e
. Since,
'shade'
- matches the pattern,bool()
returnsTrue
'science'
- does not match the pattern,bool()
returnsFalse
MetaCharacters in Python Regular Expression
The characters that are interpreted in a special way by a RegEx engine are metacharacters.
Here's a list of metacharacters with a short description:
Metacharacter | Description |
---|---|
[ ] | specifies a set of characters we wish to match |
. | matches any single character |
^ | checks if a string starts with a certain character |
$ | checks if a string ends with a certain character |
* | matches zero or more occurrences of the pattern left to it |
+ | matches one or more occurrences of the pattern left to it |
? | matches zero or one occurrence of the pattern left to it |
( ) | groups sub-patterns |
\ | used to escape various characters including all metacharacters |
| | used for alternation (or operator) |
MetaCharacters Examples:
[ ] - Square Brackets
Expression | String | Match? |
---|---|---|
[xyz] |
x | 1 match |
hey | 1 match | |
hello | No match | |
proxy | 2 matches |
Here, [xyz]
will match if the string you are trying to match contains any of the x, y, or z.
We can also specify a range of characters using -
inside square brackets.
For example, [w-z]
is the same as [wxyz]
and similarly [1-4]
is the same as [1234]
.
. - Period
Expression | String | Match? |
---|---|---|
... |
hey | 1 match |
python | 2 matches (contains 3 characters) | |
a | No match | |
sa | No match |
We can see that .
matches any single character (except newline '\n'
).
^ - Caret
Expression | String | Match? |
---|---|---|
^s |
s | 1 match |
swift | 1 match | |
tsunami | No match | |
case | No match |
Here, ^
is used to check if a string starts with a certain character.
$ - Dollar
Expression | String | Match? |
---|---|---|
$s |
s | 1 match |
kicks | 1 match | |
sick | No match | |
case | No match |
Above, $
checks if a string ends with a certain character or not.
* - Star
Expression | String | Match? |
---|---|---|
hel*o |
heo | 1 match |
hello | 1 match | |
hola | No match (not ending with o) | |
hell | No match |
Here, *
matches zero or more occurrences of the pattern left to it.
+ - Plus
Expression | String | Match? |
---|---|---|
hel+o |
helo | 1 match |
hellllo | 1 match | |
hola | No match | |
heo | No match (zero occurrence) |
We can see above that +
matches one or more occurrences of the pattern left to it.
? - Question Mark
Expression | String | Match? |
---|---|---|
hel+o |
heo | 1 match (zero occurrence) |
helo | 1 match (one occurrence) | |
sayhelo | 1 match | |
hello | No match (more than one occurrences) |
Here, ?
matches zero or one occurrences of the pattern left to it.
| - Alternation
Expression | String | Match? |
---|---|---|
s|a |
cat | 1 match (a in cat) |
case | 2 matches (a and s both in case) | |
lit | No match | |
red | No match |
Here, s|a
match any string that contains either s
or a
() - Group
Expression | String | Match? |
---|---|---|
(c|l|t)an |
can | 1 match (a in cat) |
lan | 1 match | |
tan | 1 match | |
caan | No match |
In the above example, (c|l|t)an
matches any string that matches either c
or l
or t
followed by an
.
Python Special Sequences
A special sequence is \
followed by a special character which makes commonly used patterns easier to write.
Here's a list of special sequence with a short description:
Special Sequence | Description |
---|---|
\A | matches if the specified characters are at the start of a string |
\b | matches if the specified characters are at the beginning or end of a word |
\B | matches if the specified characters are not at the beginning or end of a word |
\d | matches any decimal digit |
\D | matches any non-decimal digit |
\s | matches where a string contains any whitespace character |
\S | matches where a string contains any non-whitespace character |
\w | matches any alphanumeric character |
\W | matches any non-alphanumeric character |
\Z | matches if the specified characters are at the end of a string |
Special Sequence Examples:
\A
Expression | String | Match? |
---|---|---|
\Aan |
an ocean | Match |
at sea | No match |
Here, \A
matches if an
is at the start of a string or not.
\b
Expression | String | Match? |
---|---|---|
\bdis |
diss track | Match |
a disco | Match | |
adisco | No Match | |
nt\b |
bent | Match |
aunt | Match | |
act | No Match |
We can see that \b
matches if the specified characters
\bdis
- are at the beginning of a word or notnt\b
- are at the end of word or not
\B
Expression | String | Match? |
---|---|---|
\Bdis |
diss track | No Match |
a disco | No Match | |
adisco | Match | |
nt\B |
bent | No Match |
aunt | No Match | |
ant | Match |
We can see that \B
is opposite of \b
. That is, it matches if the specified characters are not at the beginning or end of a word.
\d
Expression | String | Match? |
---|---|---|
\d |
h3llo | 1 Match |
hello | No Match |
Here, \d
matches any decimal digit [0-9].
\D
Expression | String | Match? |
---|---|---|
\D |
1234 | No Match |
h3llo | 4 Matches |
We can see that \D
is opposite of \d
. That is, it matches any string that does not contain a non-decimal digit.
\s
Expression | String | Match? |
---|---|---|
\s |
hello world | 1 Match |
helloworld | No Match |
Here, \s
matches where a string contains any whitespace character.
\S
Expression | String | Match? |
---|---|---|
\S |
x y | 2 Match |
x | 1 Match |
Here, \S
matches where a string contains any non-whitespace character.
\w
Expression | String | Match? |
---|---|---|
\w |
67%;gt | 4 Matches |
!>%" | No Match |
Here, \w
matches any alphanumeric character (digits and alphabets).
\W
Expression | String | Match? |
---|---|---|
\W |
!>%" | 4 Matches |
hello | No Match |
\W
is opposite of \w
. It matches any non-alphanumeric character (digits and alphabets).
\Z
Expression | String | Match? |
---|---|---|
coding\Z |
I love coding | 1 Match |
coding is fun | No Match |
Here, \Z
matches if 'coding'
is at the end of a string or not.
The re.search() Function
In Python, the re.search()
function will search the regex pattern and return the first occurrence.
It is slightly different from re.match()
where all lines of the input string are checked.
Let's see an example,
import re
# test string
string1 = 'Nepal is beautiful'
string2 = 'Datamentor for beginners'
# check if 'Nepal' is at the beginning of string1
result1 = re.search('\ANepal', string1) # True
# check if 'beginners' ia at the beginning of string2
result2 = re.search('\Abeginners', string2) # False
# print boolean value
print('Result for string1:', bool(result1)) # True
print('Result for string2:', bool(result2)) # False
Output
Result for string1: True Result for string2: False
In the above example, we first imported a module named re
and used the re.search()
function to search for the pattern.
Here, re.search()
take two parameters:
\ANepal
and\Abeginners
-\A
matches if the given word is at the start of a string- string1 and string2 - the string in which the pattern is checked
Since,
'Nepal'
is at the beginning of string1,bool()
returnsTrue
'beginners'
is not at the beginning of string2,bool()
returnsFalse
The re.split() Function
The re.split()
function in Python splits the string at each match and returns a list. For example,
import re
# test string
string1 = 'Nepal is beautiful'
# check if 'Nepal' is at the beginning of string1
result1 = re.split('\s', string1)
# print boolean value
print(result1)
# Output: ['Nepal', 'is', 'beautiful']
In the above example, we have used the re.split()
function to split the string named string1.
Here, re.split('\s', string1)
splits string1 at each white-space character.
Note: We can use other special sequences inside re.split()
to split the given string.
The re.findall() Function
In Python, the re.findall()
function returns a list of strings containing all matches. For example,
import re
string1 = 'H3ll0 W0R1D'
pattern = '\D+'
# extract non-digits from a string
result = re.findall(pattern, string1)
print(result)
# Output: ['H', 'll', ' W', 'R', 'D']
Here, the re.findall()
function returns a list that contains non-digits from the string1 string.
Note: re.findall() returns an empty list if the pattern is not found in the string.
The re.sub() Function
The re.sub()
function in Python returns a string after replacing the matched occurrence in a string with a replacement string. For example,
import re
string1 = 'Hello World'
# replacement string
replace = 'Hola'
# matches if 'Hello' is at the start or not
pattern = '\AHello'
# replace 'Hello' with 'Hola'
result = re.sub(pattern, replace, string1)
print(result)
# Output: Hola World
In the above example, we have used the re.sub()
function to replace 'Hello'
with 'Hola'
in the string1.
re.sub()
returns the original string if the pattern is not found.
Python Match Object
The match object in Python contains all the information about the search and the result. For example,
import re
# test string
string1 = 'Nepal is beautiful'
# result contains match object
result = re.search('\ANepal', string1)
print(result)
Output
<re.Match object; span=(0, 5), match='Nepal'>
Here, the result variable contains a match object.
Methods and Attributes of Python Match Object
Some of the commonly used methods and attributes of match objects are:
match.group()
The group()
function returns the matched substring. For example,
import re
string1 = 'Employee ID 2032 1111'
# Two digit number followed by space followed by three digit number
pattern = '(\d{2}) (\d{3})'
# match variable contains a Match object.
match = re.search(pattern, string1)
# get substring
print('Whole Substring:', match.group())
# get first part of substring
print('First part of substring:', match.group(1))
# get second part of substring
print('Second part of substring:', match.group(2))
Output
Whole Substring: 32 111 First part of substring: 32 Second part of substring: 111
In the above example, we have used the group()
function to return the matched substring from the string named string1.
Here, the pattern '(\d{2}) (\d{3})'
means: two digit number followed by space followed by three digit number.
To get the matched substring we have used
match.group()
- to get the whole substringmatch.group(1)
- to get first part of substringmatch.group(2)
- to get second part of substring
match.start(), match.end(), and match.span()
- The
start()
function returns the index of the start of the matched substring - The
end()
function returns the end index of the matched substring - The
span()
function returns a tuple containing start and end index of the matched substring
Let's see an example,
import re
string = 'Employee ID 2032 1111'
# Two digit number followed by space followed by three digit number
pattern = '(\d{2}) (\d{3})'
# match variable contains a Match object.
match = re.search(pattern, string)
print('Matched Substring Start Index:', match.start())
print('Matched Substring End Index:', match.end())
print('Tuple of Matched Substring Start and End Index:', match.span())
Output
Matched Substring Start Index: 14 Matched Substring End Index: 20 Tuple of Matched Substring Start and End Index: (14, 20)
Raw String in Python
Raw string is useful if we want to treat backslash (\
) as a literal character.
For example, '\n'
is a new line whereas r'\n'
means two characters: a backslash \
followed by n
.
Let's understand raw string with the help of an example,
import re
# \n to get new line
string1 = 'Hello\nWorld'
print("Escape Character:", string1)
# prefix r to treat \n as a normal character
string2 = r'Hello\nWorld'
print('Raw String:', string2)
Output
Escape Character: Hello World Raw String: Hello\nWorld
Using r prefix before RegEx
In Python, we can prefix r
before a regular expression. For example,
import re
# test string
string1 = '\t Programming \n is \r fun.'
pattern = r'[\t\n\r]'
# find \t,\n, and \r in string1
result = re.findall(pattern, string1)
print(result)
# Output: ['\t', '\n', '\r']
Here, first we have prefixed r
before the regular expression pattern as
pattern = r'[\t\n\r]'
And used re.findall()
to return a list of strings containing all matches.