Python Recipe: Read a file, search for a pattern, print your matches.
Our first two recipes focused primarily on how to open one or more files and loop through them line by line. While we paid a little attention to how we could search for patterns using regular expressions, we didn’t try to do a whole lot with what we caught. Hell, we didn’t even try very hard to write a good regex.
But when you start to get serious about searching for patterns in text, one of the obvious goals is to single out and collect your matches. Maybe you want to pull all the phone numbers out of big blobs of text. Or email addresses. Or anything enclosed in quotation marks. Whatever.
Here’s one way to try it.
But before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.
Formalities aside, here’s the example task I’ve selected to achieve our mission.
- Download the King James Version of the Holy Bible.
- Read through each line of text.
- Capture each four-letter word.
- Print them out.
Let’s do it.
1. Open the command line, create a working directory, move there.
We’re going to start the same way we did in the first two lessons, creating a working folder for all our files and moving in with our command line.
cd Documents/ mkdir py-search-and-capture cd py-search-and-capture/ |
The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.
2. Download our source file, The King James Version of the Holy Bible
We’re going to use the text file provided Project Gutenberg as our source. As in the earlier lessons, I’m going to use the curl command-line utility to retrieve the file, but you should feel free to download it to our working directory using your web browser, if you prefer.
curl -O http://www.gutenberg.org/dirs/etext90/kjv10.txt |
3. Create our python script in the text editor of your choice.
vim search.py |
The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.
If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:
i |
4. Write the code!
#!/usr/bin/env python import re bible = open("kjv10.txt", "r") regex = re.compile(r'\b\w{4}\b') for line in bible: four_letter_words = regex.findall(line) for word in four_letter_words: print word |
Our file opens by importing the re module, which will allow us to call upon Python’s regular expression library. We then open our Bible into a variable of the same name and, as in previous recipes, open a loop that will iterate through each line in the file.
The new stuff comes next. The first statement above our loop uses re’s compile method to store our regular expression pattern into a variable called “regex.” (As commenter Paddy suggested below, it’s a good idea to put it above the loop so it doesn’t have to be repeated on each iteration.) Remember, our goal is to match any four-letter words. There are three regex symbols I squished together to give it a hack. They are defined as follows. I’ve drawn the definitions from this Python reference, which can probably help you crack most nuts.
- \b - Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word.
- \w - Matches any alphanumeric character
- {m,n} - There must be at least m repetitions, and at most n.
So when you piece them together like so, “\b\w{4}\b”, what you’re asking for is any stretch of four alphanumeric characters between two word boundaries. Make sense?
Next, equipped with our regex, we create another variable called “four_letter_words.” In it we see our regex variable pressed against a re method we haven’t used before. In the previous lessons we used the kludgy match() function to make our hits. Here we’re using something more elegant. It’s findall(), which will return all the matches within our line as a list. And by connecting it to our pre-compiled “regex” variable, we’re setting that as the pattern it should look for.
We can expect plenty of lines with more than one match, so we’ll set up another loop to run through “four_letter_words” and print out all the hits. And then we’re done. Save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…
python py-search.py |
And, voila, there you have it. All the four-letter words in the KJV. Every f*cking one.
Unless I messed something up, of course. Per usual, if you spot a screw up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.
Tagsbible, Books, code, computer-assisted reporting, four-letter words, linux, loop, match, open, parse, print, python, read, recipe, regex, regular expressions, script, search, text, tutorial














the gun wrote:
This definitely demonstrates the magnitude of regular expressions. My usual bombardment of QUESTIONS (these may be pretty rudimentary):
1. If you don’t use \w in the regex, will it find anything, including whitespace chars?noticed we didn’t use in the shakespeare recipes.
2. “{m,n} - There must be at least m repetitions, and at most n.” By just putting the 4 inside {}, does this mean 4 is both m and n? In other words, same as {4,4}
3. still not sure I understand the line variable. Is that some standard keyword or our own created variable. I guess I can’t figure out if it knows something is a line since we didn’t specify that.
4. Why do we open the regex with an r ? According to that link you gave (great resource!), is it to avoid excape (\) characters?
5. We don’t need a closing/lingering comma?
Posted on 15-Apr-08 at 1:50 am | Permalink
palewire wrote:
1. What happens if you don’t include any between the word boundaries? Like “\b\b”? Hmm. I can’t think of a situation where you’d find two word boundaries in a row. But I’ll never say never when it comes to this stuff.
2. Yep, you’re right.
3. I don’t know all the details on this one, but my understanding is that “for line in” is a tuned up version of earlier Python methods for line-reading, like xreadlines(). As a low-level user, I really don’t know what’s happening behind the scenes. Here are the basic input and output docs.
4. You’re right, the purpose of prefixing a regular expression with an “r” is to avoid the “backslash plague” by using r to switch to “raw string notation.” That’s necessary here because we’re using the “\b” word boundary. As the guide I linked to says:
5. I’m not sure what you mean about a “closing/lingering comma.” That’s a convention I’m unfamilar with, unless you mean the “print hits,” style comma I might have used in past recipes. By default, Python’s print command will follow whatever it spits out with a newline command “\n” that will kick you down to the next line. But if you stick a comma at the end of your print statements, it will stop doing that.
Posted on 15-Apr-08 at 8:52 am | Permalink
Paddy3118 wrote:
Hi,
re.compile should be moved up, above the first for loop so it is only compiled once.
- Paddy.
Posted on 15-Apr-08 at 11:19 am | Permalink
palewire wrote:
Good call. I’ll make the switch.
For a file of this size (not very big), it probably doesn’t make a big difference in processing time, but it’s definitely the right move. And something that distinguishes a careful programmer from a hack reporter.
Posted on 15-Apr-08 at 11:27 am | Permalink
Paddy3118 wrote:
Your welcome
Posted on 15-Apr-08 at 9:45 pm | Permalink
t gun wrote:
i have a comment. when’s the next python recipe?! I live for those things! Can i suggest one? how about one that replaces text, say you have a string of chars and/or whitespace and what to do some global replacing? i’m really tired of find and replace, i’m looking to improve my life through python. anyways, know you’re busy but these are really great
Posted on 16-Apr-08 at 5:31 pm | Permalink
palewire wrote:
We could do that. I was also considering showing how to automate and timestamp a download from the Web.
Posted on 16-Apr-08 at 8:16 pm | Permalink
tommyslash wrote:
that … would … be … AWESOME
Posted on 17-Apr-08 at 1:37 am | Permalink
palewire / Python Recipe: Grab a page, scrape a table, download a file. wrote:
[...] br, ie, whatever). We then use its open() method to grab the location of our first scrape target, my favorite albums of 2007, and store that in another variable we’ll call [...]
Posted on 20-Apr-08 at 1:44 pm | Permalink
Ben wrote:
Hey,
I just thought of something, i’ve just followed through the recipes for python you have here about searching through text, and i’ve been downloading all the source text from gutenberg.org. I just went over to the site to see if there was any other source texts i could play around with, and i realised there was a zipped text version of all the files we have been using. Whenever its available I try to download compressed versions of files, especially on ‘free’ projects like gutenberg, as it saves them bandwidth and thus cost. I’m not sure how much traffic your site gets, but even if its just a little bit you could have cut down on bandwidth used by getting the zip files and unzipping them locally. It would also give you an excuse to show people how to batch unzip files from CLI.
Posted on 03-Jul-08 at 2:45 pm | Permalink
Ben wrote:
Also, is there an efficient way to count the unique hits on words, checking it against the previous words before printing it out. (this would involve storing it in a list and checking through them all, right? I added a little counter to the word prints, and there is 176,189 so that would take a while surely?
Ben
Posted on 03-Jul-08 at 2:57 pm | Permalink
Ben wrote:
I extended the program to count unique hits only if anyone is interested, or could tell me a more efficient way. (maybe python has a binary tree module or something?)
[code]
#!/usr/bin/env python
import re
bible = open(”kjv10.txt”,”r”)
regex = re.compile(r”\b\w{4}\b”)
count = 0
words = []
for line in bible:
fourletterwords = regex.findall(line) # returns all matches as a list
for word in fourletterwords:
for w in words:
if w == word:
break
else:
words.append(word)
count = count + 1
print count, word
print count, “unique words”
[/code]
Posted on 03-Jul-08 at 3:14 pm | Permalink
palewire wrote:
Hey Ben, sorry it’s taken me so long to respond. Still looking for a unique words solution? I think some of my other Python posts might cover that.
As far as the zip thing goes, my blog is read lightly enough that it probably doesn’t matter much in the wide view.
Thanks for reading. I hope you found this stuff helpful.
Posted on 12-Jul-08 at 10:38 am | Permalink
Ben wrote:
I have found these posts very helpful and informative, thanks alot!
Posted on 06-Aug-08 at 5:42 pm | Permalink