Regular expressions#
Regular expressions are an immensely powerful tool built into most modern computer languages. They are a type of formal grammar that allow you to match strings that match or mismatch a particular rule. Common uses include checking if user input conforms to a desired pattern (e.g., 3 numbers followed by two numbers, followed by 3 numbers), to all sorts of complicated search-replace operations both in text-files and, e.g., renaming files.
There are entire books written on regular expressions as well as comprehensive online references. We’ll only concern ourselves with a few basics here.
Start by looking over the first 10 lessons of this tutorial (they’re very quick), paying special attention to the sidebar on right, which I reproduce below.
Now go to this snazzy interactive regexp matcher and play around with it to get a feel for the syntax.
Then go through the rest of this notebook and make sure you understand why each regexp works in the way it does.
Syntax |
Meaning |
abc… |
Literal letters |
\d |
Any Digit |
\D |
Any Non-digit character |
. |
Any single character |
. |
Period (slash is an escape character) |
[abc] |
Only a, b, or c |
[^abc] |
Not a, b, nor c |
[a-z] |
Characters a to z |
[A-Z] |
Characters A to Z |
[0-9] |
Numbers 0 to 9 |
\w |
Any Alphanumeric character |
\W |
Any Non-alphanumeric character |
{m} |
m Repetitions |
{m,n} |
m to n Repetitions |
* |
Zero or more repetitions |
? |
Optional character (0 or 1) |
+ |
One or more repetitions |
\s |
Any Whitespace |
\S |
Any Non-whitespace character |
^ |
Start of string (or line for multiline matching) |
$ |
End of string (or line for multiline matching) |
(…) |
Capture Group (for capturing matches and backreference) |
(a(bc)) |
Capture Sub-group |
(.*) |
Capture all |
(abc|def) |
Matches abc or def |
Use as a filter#
Let’s begin by reading in a file containing a bunch of words from the American National Corpus that have a frequency of at least 9. Here’s a sample of what this file looks like.
word lemma pos freq
the the DT 1081168
of of IN 539793
and and CC 466737
to to TO 448519
a a DT 406057
in in IN 360853
is be VBZ 192975
For those unfamiliar with language lingo, English lemmas are basically the word-stems, e.g., the lemma of cars is car; the lemma of walking is walk. pos stands for part of speech.
import re #import the python regexp module
import pandas as pd
data = pd.read_csv('',encoding = "ISO-8859-1",sep="\t")
words = list(set(data['word']))[1:]
print (f"We have {len(words)} unique words")
We have 48316 unique words
Now let’s use some regular expressions starting with simple ones, and moving on to every slightly more complicated ones.
Grab words beginning with q
[curWord for curWord in words if re.findall('^q',curWord)]
Show code cell output
[curWord for curWord in words if re.findall('l{1}.+m{1}.+n{1}.+o{1}',curWord)]
Grab all words begin with an a and end with an i
[curWord for curWord in words if re.findall('^a\w+i$',curWord)]
Show code cell output
Grab all words that begin with an a, followed by 4-6 letters and and on an i
[curWord for curWord in words if re.findall('^a\w{4,6}i$',curWord)]
Show code cell output
Grab words that start with a b, end on an t, and contain a t somewhere in the middle
[curWord for curWord in words if re.findall('^b\w+t\w+t$',curWord)]
Show code cell output
Let’s say we want to exclude words that end on two ts.
[curWord for curWord in words if re.findall('^b\w+t\w+[^tt]t$',curWord)]
Show code cell output
Let’s get all the words containing the vowels a, e, i, o, in that order
[curWord for curWord in words if re.findall('\w+a+\w+e+\w+i+\w+o+',curWord)]
Show code cell output
You know that saying i before e except after c (in which case it’s i after e, like receive). Let’s see how well this mnemonic holds up.
Let’s find out how many words there are that have ie vs. ei in them.
print ("ie words:", len([curWord for curWord in words if re.findall('ie',curWord)]))
print ("ei words:", len([curWord for curWord in words if re.findall('ei',curWord)]))
ie words: 1439
ei words: 483
Now let’s check what happens when we check for a ‘c’ preceding ie/ei
print ("cie words:", len([curWord for curWord in words if re.findall('cie',curWord)]))
print ("cei words:", len([curWord for curWord in words if re.findall('cei',curWord)]))
cie words: 107
cei words: 33
There are actually more words that violate the mnemonic than those that obey it! What are these words?
[curWord for curWord in words if re.findall('cie',curWord)]
Show code cell output
Here’s a tricky one. Let’s find words containing 4 rs (interspersed among other letters). One way to do this is to explicitly specify it… any character, r, any character, r.. etc. Like so..
[curWord for curWord in words if re.findall('\w*r+\w*r+\w*r+\w*r+\w*',curWord)]
There are two shortcomings to this approach. The first is that if we want 3 or 5 matches, we need to explicitly remove or add code rather than changing a single number-of-matches parameter. Another shortcoming is that hyphenated words are excluded. We can add hyphens by replacing \w
with [a-z\-]
, but that makes the expression even longer. Here’s a better solution:
[curWord for curWord in words if re.findall('([^r]*r[^r]*){4}$',curWord)]
Let’s unpack that. We are matching a group which is demarcated by parentheses. The group pattern is: not-an-r (0 or more times), an r, and then not-an-r (0 or more times). We want words that match this pattern exactly 4 times. That gives us all the words containing four rs and anything in between them (including nothing, hence grrrr)
Use in place of conditionals#
Let’s say we want to check whether an entered word is color or the British colour. We can do this with a conditional (if "color" or "colour"
), but we can also use regular expressions (which scale much better than conditionals). For example:
re.findall('colou?r','The British like to colour their colors')
['colour', 'color']
Unlike conditionals, this approach easily scales to, e.g., all cases where a non-initial ‘o’ is followed by either an ‘r’ or a ‘ur’. What regexp would match color/colour, favor/favour, humor/humour, neighbor/neighbour, (but not match or/our)?
Here’s another example of a regexp using a series of disjunctions (OR statements) that matches “dog” and “cat” and “cag” and “cog” but not “got”. To get the matching string from the match object, use .group()
, i.e., variousWords.match('cat').group()
will return “cat”
variousWords = re.compile('[d|c][a|o][g|t]')
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(0, 3), match='dog'>
<re.Match object; span=(0, 3), match='cog'>
<re.Match object; span=(0, 3), match='cag'>
Here are some more examples.
import re
#will match any numbers
anyNums = re.compile('[0-9]+')
anyNums.findall('There are 99 bottles of beer on the wall. 999....') #will return all matches'There are 99 bottles of beer on the wall. 999....').group() #will return just the first occurrence
#two digit numbers from 00 to 59 or 80 to 89
someNums = re.compile('[0-5][0-9]|[8][0-9]')
matches = [ for x in 'It will match 54, 52, and 88, but not 7 or 92 or any of the letters'.split(' ') if]
#We don’t need to compile regular expressions using re.compile. but it speeds things up when using the same rule over a large corpus.
emailRegExGrouped = re.compile('([\w.-]+)@([\w.-]+)')
#the parenthesis allow us to access groups -- the first group corresponds to the first matched part (before the @). The second group to the domain (e.g.,'').groups()
('g.lupyan', '')
#returns [('g.lupyan', ''), ('lupyan', '')]
#to get all the domains:
[email[1] for email in emailRegExGrouped.findall('')]
#returns ['', '']
['', '']
Search and replace#
All good text editors allow you to use regular expressions in search and replace.
A simple usage case is searching for lines that begin or end with a certain character sequence. To find lines that begin with “ab”, search for ^ab
. To find lines that end on ies
search for ies$
. If you’re trying to replace a string with some variant of the matched string, you’ll want to use capture groups.
Make sure to enable regular-expression search by clicking on .* button in the lower-left corner in Sublime text. or checking the appropriate box (sometimes labeled “Grep”) in other text editors
When using regular expressions in search/replace, it becomes useful to use matching groups.
For example, suppose you want to replace the occurrences of the following strings, which occur at the start of each line:
You could manually do search replaces for each one. But if you have a hundred of these, that gets tedious fast and is a recipe for errors.
Here’s a much better solution. Simply search for:
and replace with
The \2 refers to the second group, i.e., the number
Here’s another example. Delete all the lines that start with some letters and end in ‘ing’:
Replace with: nothing
Now you can do another search and replace, searching for
and replacing with
To get rid of multiple newlines that the first search/replace may have created.
Batch file renaming#
You can use what you’ve learned about regular expressions for manipulating not just actual text, but text used in file names. You can do this in straight-up Python using the os
library, or by using GUI programs like NameChanger (Mac), or Bulk Rename (PC). These programs allow you to do batch renaming of files using simple search/replace (e.g., replace _
with -
as well as by using regular expressions for more complex changes!