How to count blank lines?
Most people consider a line with just spaces and tabs to be blank
How to extract phone numbers from a text file?
Many valid formats: 416-111-2222, (416) 111 2222, etc. but not 4 161 112 222
Examining characters one by one is painful
Use regular expressions instead
Represent patterns as strings
Search for substrings that match patterns
Warning: the notation is ugly
Only so many characters on the keyboard
Most characters will simply match themselves
Metacharacters are special, and don't match themselves.
Here is a complete list of metacharacters: . ^ $ * + ? { [ ] \ | ( )
Escape metacharacters to match a literal character
E.g. \+
matches a literal plus sign
Character classes are specified by square brackets
[abc]
or [a-c]
will match any of the characters "a", "b", or "c"
[^a-c]
matches everything except a to c
N.B. If you want to match a hyphen inside a character class, make sure to escape the hyphen!
Use escape sequences for common character sets
\d |
Digits | [0-9] |
\w |
Words | [a-zA-Z0-9_] |
\s |
Space | [ \t\n\r\f\v] |
. |
Anything except end-of-line | [^\n] |
\D |
Any non-digit character | [^0-9] |
\W |
Non-word character | [^a-zA-Z0-9_] |
\S |
Non-space character | [^ \t\n\r\f\v] |
Pattern | Matches | Explanation |
a* |
'', 'a', 'aa', ... |
Zero or more |
b+ |
'b', 'bb', ... |
One or more |
ab?c |
'ac', 'abc' |
Optional (zero or one) |
[abc] |
'a', 'b', 'c' |
One from a set |
[a-c] |
'a', 'b', 'c' |
Abbreviation |
[abc]* |
'', 'accb', ... |
Combination |
Load the re
module
Use re.search(pattern, text)
#!/bin/env python2.3 import sys, re pat = sys.argv[1] for text in sys.argv[2:]: if re.search(pat, text): result = "FOUND" else: result = "NOT FOUND" print pat, text, resultSave the above program as "testMatch.py", and run the below command.
$ testMatch "a.*a.+a" salamander aardvark snuffeluffegus salamander FOUND aardvark FOUND snuffeluffegus NOT FOUND
Note quotes around pattern on command line
Otherwise, shell tries to interpret the '*'
And notice that a pattern doesn't have to match all of the text
Force position of match using anchors
^
matches beginning of line
$
matches end
Neither consumes any characters
Pattern | Text | Result |
b+ |
abbc |
Matches |
^b+ |
abbc |
Fails (no b at start) |
^a*$ |
aabaa |
Fails (not all a 's) |
Can use \A
and \Z
(begin/end
of string) instead of
^
and $
if no newlines in the string
Match actual ^
and $
using escape sequences \^
and \$
Must represent these in strings as "\\^"
and "\\$"
Two layers of compilation:
Python/Java turn double backslashes into single backslash character
Regular expression library then compiles single backslash plus something into special operation
Idea: Get rid of the first layer of compilation
We want our strings to be passed directly into the regular expression engine
To allow this, use "raw" strings
Raw strings leave all backslashes "as-is" in the string
So r"\n": is a literal backslash followed by an n
And r"\"" is a literal backslash followed by a quote
When you include this raw string as part of a regular expression, the regexp engine parses the backslash-n and the backslash-quote and converts them into newlines and quotes.
Common escape sequences
\t |
Tab |
\n |
Newline |
\* |
Asterisk |
\\ |
Backslash |
\b |
Break between word and space |
Double-backslashing can give you headaches, but it just looks scarier than it trully is
Here's one example where escaping is quite important:
import sys import re for line in sys.stdin: print line, if re.search("\section",line): print "FOUND PAT1" else: print "NOT FOUND PAT1" if re.search("\\section",line): print "FOUND PAT2" else: print "NOT FOUND PAT2" if re.search("\\\\section",line): print "FOUND PAT3" else: print "NOT FOUND PAT3" if re.search(r"\\section",line): print "FOUND PAT4" else: print "NOT FOUND PAT4"For the input file:
this is a Latex file \section{Section1}The output will be:
this is a Latex file NOT FOUND PAT1 NOT FOUND PAT2 NOT FOUND PAT3 NOT FOUND PAT4 \section{Section1} NOT FOUND PAT1 NOT FOUND PAT2 FOUND PAT3 FOUND PAT4
And one where it doesn't make too much of a difference:
import sys import re for line in sys.stdin: print line, if re.search("\\t",line): print "FOUND PAT1" else: print "NOT FOUND PAT1" if re.search("\t",line): print "FOUND PAT2" else: print "NOT FOUND PAT2" if re.search(r"\t",line): print "FOUND PAT3" else: print "NOT FOUND PAT3" if re.search(r"\\t",line): print "FOUND PAT4" else: print "NOT FOUND PAT4"For the input file:
3 5 3\t5The output will be:
3 5 FOUND PAT1 FOUND PAT2 FOUND PAT3 NOT FOUND PAT4 3\t5 NOT FOUND PAT1 NOT FOUND PAT2 NOT FOUND PAT3 FOUND PAT4
Moral of the story?
Use r""
to keep RE readable!
import sys, re # start of line, any number of spaces, tabs, carriage returns, # and newlines, end of line blank = r"^\s*$" count = 0 for line in sys.stdin: if re.search(blank, line): count += 1 print count
Result of re.search()
is a match object
mo.group()
returns string that matched
mo.start()
and mo.end()
are the match's location
mo = re.search(r"b+", "abbcb") print mo.group(), mo.start(), mo.end() bb 1 3
All parenthesized sub-patterns are remembered
Text that matched Nth parentheses (counting from left) is group N
pat = r"\s*(\w+)\s+(\d+)\s*" for line in sys.stdin: mo = re.search(pat,line) if mo: name = mo.group(1) grade = mo.group(2) print name,"has a score of",grade,"on A1"Input file:
John 80 Anne 92 Dan 34Output:
John has a score of 80 on A1 Anne has a score of 92 on A1 Dan has a score of 34 on A1
cols = r"\s*(\d+)\s+(\d+)\s*" for line in sys.stdin: mo = re.match(cols, line) if mo: a, b = mo.group(1), mo.group(2) print "%s\t%s" % (b, a)
import re if re.match(r"box","inbox"): print "match succeeded" else: print "match failed" if re.search(r"box","inbox"): print "search succeeded" else: print "search failed" match failed search succeeded
Regular expression library compiles patterns into more concise form for matching
Can improve performance by doing this once, and re-using the compiled form
The compiled pattern is an object that has RE methods
nameCase = r"[^A-Z]*([A-Z][a-z]*)(.*)" matcher = re.compile(nameCase) for line in sys.stdin: mo = matcher.search(line) while mo: print mo.group(1) mo = matcher.search(mo.group(2))
What does the program output given the following input:
Everybody loves CSC401.Everybody
What does this program do?
It is supposed to print all of the words starting with capital letters.Pattern | Matches |
a|b |
'a', 'b' |
ab|cd |
'ab', 'cd' |
a(b|c)d |
'abd', 'acd' |
(\w)\1 |
'aa', 'bb', 'zz', '11', ... |
a{2,3} |
'aa', 'aaa' |
re.sub(pattern=..., replacement=... , string=...)
Returns the string obtained by replacing every non-overlapping occurrence of the pattern
in the string
by the replacement
.
If the pattern isn't found, string is returned unchanged.
Example: replace colour names with the word "colour"
>>> p.sub( '(blue|white|red)', 'colour', 'blue socks and red shoes') 'colour socks and colour shoes' >>> p.sub( '(blue|white|red)', 'colour', 'blue socks and red shoes', count=1) 'colour socks and red shoes'
Empty matches are replaced only when they're not adjacent to a previous match
>>> import re >>> re.sub('x*','-', 'abxd') '-a-b-d-'
You will need to expand clitics in assignment one.
re.split(pattern=..., string=..)
Splits string into a list of substrings separated by the pattern in the regular expression.
Example
>>> import re >>> re.split(r'\s+', 'The Cat in the Hat'); ['The','Cat','in','the','Hat'] >>> re.split(r'(\s+)', 'The Cat in the Hat'); ['The',' ', 'Cat', ' ', 'in', ' ', 'the', ' ', 'Hat'] >>> re.split(r'(a)(t)', 'The Cat in the Hat'); ['The C', 'a', 't', ' in the H', 'a', 't', ''] >>> re.split(r'a(t)', 'The Cat in the Hat'); ['The C', 't', ' in the H', 't', '']
Note: groups in patterns result in the matched sub-string being interleaved in the split list
Hint: This is good for tokenizing sentences
Remove whitespaces in time representation from a scheduler
(only whitespaces between a colon and a digit)
Input file:
Dentist 10: 30AM CSC 401 8:15PM
Solution 1:
pat = re.compile('(:)\s+(\d+)') for line in sys.stdin: print pat.split(line) ['Dentist 10', ':', '30', 'AM\n'] ['CSC 401 8:15PM\n']
That didn't look too good, let's try Solution 2:
pat = re.compile('(:)\s+(\d+)') for line in sys.stdin: print "".join(pat.split(line)) Dentist 10:30AM CSC 401 8:15PM
And now Solution 3, using only REs:
pat1 = re.compile(':\s+\d') def rem_spaces_in_hour(mo): return mo.group(0)[0]+mo.group(0)[-1] for line in sys.stdin: print pat1.sub(rem_spaces_in_hour,line) Dentist 10:30AM CSC 401 8:15PM
Module provides many other tools
findall(pattern, string)
finditer(pattern, string)
sub(old, new, string, count=all)
You'll need these for your assignment