Terminal Recipe: How to download an entire Web site with wget.

From time to time, an occasion might arise when you’d like to download an entire Web site. At my old job, I liked to pull down government sites and go on fishing expeditions with Google Desktop Search for hot terms (ex. names of corporations and political appointees) or certain file types (ex. Excel, Access, CSV, et cetera). And the other day at my current job a situation cropped up where the newsroom wanted to download a bunch of files quickly, so it was handy to set a spider loose rather than sit there and try to download everything click by click.

One way to handle the job is to use a command-line utility called wget to crawl your target and mirror its files on your local computer.

If you’re working on a Macbook, the first thing you’ll need to do is install wget. I’d suggest you do that by downloading the latest version and compiling the binary from the source code. That might sound scary, but it’s just a fancy way of saying you’re going to install something from the command line instead of clicking a bunch of pretty boxes. Some other sites are going to push you toward pretty boxes and maybe even this big bloated thing called Fink, but, trust me on this one, it’s going to be a lot easier for you down the road if you learn how to install stuff yourself. And this is a simple enough example that it’s worth the shot.

So, if you’re with me, before you do anything else, download and install Mac’s XCode, which includes the compilers you’ll need to build stuff on your own.

Then just open your terminal and let rip with the following…

mkdir src
cd src
curl -O http://ftp.gnu.org/gnu/wget/wget-latest.tar.gz
tar xvfz wget-latest.tar.gz
cd wget-1.11.3
./configure
sudo make install

You’ve just compiled your first program. We just made a new directory for storing source code, downloaded wget’s source, unzipped it, and then “made” the file with our XCode compiler. Pretty easy, right? The only catch is that you’ll need your computer’s administrator password to run the “sudo” command that will create wget’s binary in your system folder.

In the future, that configure/make part is going to be the same for most of the source code you run into. When you encounter a new batch, just check the INSTALL or README docs where they’ll usually let you know if there’s anything else fancy you need to do.

Now test it out by hammering in the following…

wget

And there’s your new utility, waiting to run things down on your behalf. Check out how it easy it is. Want to mirror a Web site? Here’s all you need to type…

wget -mk http://www.foo.com

Blammo, you’re off to the races, walking your target’s directory structure and saving all the files to your hard drive. The -m option puts wget in mirror mode and the -k option will convert all the hyperlinks so they’re suitable for local viewing. Then you just feed it the URL you’re after.

If you’re a Linux or Windows user, the command should be the same. If you’re a Windows user, you can try it with a release like this one. And if you run Linux, like I do, wget should already be installed and ready to roll in most distributions. No bothering with XCode or new downloads or any of that nonsense.

Python Recipe: Print a future date in the format you want.

Enough with all the talky talky, here’s a simple snippet I cooked up for a friend this morning to solve his problem of the moment: how to coax Python into printing out a future date (6 weeks in the future, to be exact) in the format he wants. Hope it’s useful to somebody. Let me know if I screwed anything up.

>>> import datetime
>>> now = datetime.datetime.now()
>>> print now
2008-04-21 10:19:35.832928
>>> from datetime import timedelta
>>> diff = datetime.timedelta(days=42)
>>> print diff
42 days, 0:00:00
>>> print now + diff
2008-06-02 10:19:35.832928
>>> future = now + diff
>>> future.strftime("%m/%d/%Y")
'06/02/2008'

Documentation on how you can customize strftime to print dates in the format you need can be found here. Scroll down to the middle-ish part of the page.

Python Recipe: Grab a page, scrape a table, download a file.

Here’s a change of pace. Our first few lessons focused on how you can use Python to goof with a bunch of local files. This time we’re going to try something different: using Python to go online and screw around with the Web.

Whenever I caucus with aspiring NICARians and other data hungry reporters, it’s not long before the topic of web scraping comes up. While automated text processing and database management may sound well and good, there’s something sexy about pulling down a fatty government database that catches people’s imagination and inspires them to take on the challenge of learning a new programming language. Or at least entertain the idea until they run into a road block.

A number of fellow travelers do a noble job instructing people on the basics during NICAR’s annual seminars. But scraping seems like such a sought-after skill that it feels like a good idea to throw up a basic walkthrough here, where beginners can cut and paste code and any feedback can be memorialized.

But before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here’s the example task I’ve selected to achieve our mission.

  1. Install the necessary Python modules, mechanize and Beautiful Soup.
  2. Train our computer to visit Ben’s list of The Greatest Albums in the History of 2007.
  3. Parse the html and scrape out Ben’s rankings.
  4. Click through to Ben’s list of The Greatest Albums in the History of 2006 and repeat the scrape.
  5. Do it all over again, but this time download the cover art.

1. Download the mechanize and Beautiful Soup modules. Install them.

There are a dozen different methods for going about our task, so you shouldn’t assume the one I’m about to show you is the only or the best. It’s just one way to do it. And doing it this way requires a couple additions to your Python installation, which might seem a little daunting but should be doable unless IT has your computer on double secret probation.

A module is a collection of functions, defintions and statements contained in a separate file that you can import into your script. Examples native to Python used in our earlier scripts included “re”, “os” and “string.”

Out there on the Web, kind and ambitious programmers are constantly drafting, updating and publishing new modules to boil down complicated tasks into simpler forms. It it wasn’t for these people, praise be upon them, I probably wouldn’t have a job.

If you want to take advantage of their contributions, you need to plug their creations into your local Python installation. It’s usually not that hard, even on Windows!

To accomplish today’s task, we’re going to rely on two third-party modules. The first is mechanize, a Python translation of the popular Perl module for calling up and walking through Web pages. The second is Beautiful Soup, a superlatively elegant means for parsing HTML and XML documents. Working hand-in-hand, they can accomplish most simple web scrapes.

If you’re working Linux or Mac OS X, this is going to be a piece of cake. All you need is to use Python’s auto-installer Easy Install to issue the following commands:

sudo easy_install mechanize
sudo easy_install BeautifulSoup

And now you can check if the modules are available for use by cracking open your python interpreter…

python

…and attempting to import the new modules…

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

If the interpreter accepts the commands and kicks down the next line without an error, you know you’re okay. If it throws an error, you know something is off.

I don’t have a lot of Python experience working in Windows, but the method for adding modules that I’ve had success with is simply downloading the .py files to my desktop and dumping them in the “lib” folder of my Python installation. If, like me, you use Activestate’s ActivePython distribution for Windows, it should be easily found at C:/Python25/lib/. And when you browse around the directory, you should already see os.py, re.py and other modules we’re already familar with. So just visit the mechanize and Beautiful Soup homepages and retrieve the latest download. Dump the .py files in your lib folder and now you should be able to fire up your python interpreter just the same as above and introduce yourself to our new friends.

With that out of the way, we now have all the tools we need to grip and rip. So let’s do it!

2. Open the command line, create a working directory, move there.

We’re going to start the same way we did in the first three lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-scrape-and-download
cd py-scrape-and-download/

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

3. Create our python script in the text editor of your choice.

vim py-scrape-and-download.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)
 
html = page.read()
soup = BeautifulSoup(html)
 
print soup.prettify()

Our first snippet of code, seen above, shows a basic introduction to each of our new modules.

After they’ve been imported in lines two and three, we put mechanize’s browser to use right away, storing it a variable I’ve decided to call mech, but which you could call anything you wanted (ex. browser, br, ie, whatever). We then use its open() method to grab the location of our first scrape target, my favorite albums of 2007, and store that in another variable we’ll call page.

That’s enough to go out on the web and grab the page, now we need to tell Python what to do with it. Mechanize’s read() method will return all of the HTML in the page, which we store, simply, in an variable called html and then pass to BeautifulSoup’s default method so it can be prepared for processing.

The reason we need to pass the page to Beautiful Soup is that there is a ton of HTML code in the page we don’t want. Our ultimate goal isn’t to print out the complete page source. We don’t want all the junky td and img and body tags. We want to free the data from the HTML by printing it out in a machine readable format we can repurpose for our own needs. In the next step we’ll ask Beautiful Soup to step through the code and pull out only the good parts, but here in the first iteration we’ll pause with just printing out the complete page code using a fun Beautiful Soup method called prettify(). It will spit out the HTML in a well-formed format. To take a look, save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…

python py-scrape-and-download.py

And you should see something like….

<html>
 <head>
  <title>
   According to Ben...
  </title>
 </head>
 <body>
  <h2>
   The 10 Greatest Albums in the History of 2007
  </h2>
  <table padding="1" width="60%" border="1" style="text-align:center;">
   <tr style="font-weight:bold">
    <td>
     Rank
    </td>
    <td>
     Artist
    </td>
...

…which means that you’ve successfully retrieved and printed out our first target. Now let’s move on to scraping the data out from the HTML.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)
 
html = page.read()
soup = BeautifulSoup(html)
 
table = soup.find("table", border=1)
 
for row in table.findAll('tr')[1:]:
    col = row.findAll('td')
 
    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']
 
    record = (rank, artist, album, cover_link)
    print "|".join(record)

The second version of our script, seen above, removes the prettify() command that concluded version one and replaces it with the Beautiful Soup code necessary to parse the rankings from the page.

When you’re scraping a real target out there on the wild Web, the mechanize part of the script is likely to remain pretty much the same, but the Beautiful Soup portion that pulls the data from the page is going to have change each time, tailored to work with however your target HTML is structured.

So your job as the scraper is to inspect your target table and figure out how you can get Beautiful Soup to hone in on the elements you want to harvest. I like to do this using the Firefox plugin Firebug, which allows you to right-click and, by choosing the “Inspect Element” option, have the browser pull up and highlight the HTML underlying any portion of the page. But all that’s really necessary is that you take a look at the page’s source code.

Since most HTML pages you’ll be targeting, including my sample site, will include more than one set of table tags, you often have to find something unique about the table you’re after. This is necessary so that Beautiful Soup knows how to zoom in on that section of the code you’re after and ignore all the flotsam around it.

If you look closely at this particular page, you’ll note that while both table tags have the same width value, an easy way to distinguish them is that they have different border values…

<table width="60%" border="1" style="text-align: center;" padding="1">
...
<table width="60%" border="0">

…and the one we want to harvest has a border value of one. That’s why the first Beautiful Soup command seen in the snippet above uses the find() method to capture the table with that characteristic.

table = soup.find("table", border=1)

Once that’s been accomplished, the new table variable is immediately put to use in a loop that is designed to step through each row and pull out the data we want.

for row in table.findAll('tr')[1:]:

It uses Beautiful Soup’s findAll() method to put all of the tr tags (which is the HTML equivalent of a row) into a list. The [1:] modifier at the end instructs the loop to skip the first item, which, from looking at the page, we can tell is an unneeded header line.

Then, after the loop is set up on the tr tags, we set up another list that will grab all of the td tags (the HTML equivalent of a column) from each row.

    col = row.findAll('td')

Now pulling out the data is simply a matter of figuring out which order we can expect the data to appear in each row and pulling the corresponding values from the list. Since we expect rank, artist, album and cover to appear in each row from left to right, the first element of the col variable (col[0]) can always be expected to be the rank and the last element (col[3]) can always be expected to be the cover. So we create a new set of values to retrieve each, with some Beautiful Soup specific objects tacked on the end to grab only the bits we want.

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

The “.string” object will return the text within the target tag (similar to javascript’s innerHTML method). But in the case of something like the cover art, which is an image tag, not a string value, we can step down to the next tag nested within the td column — img — and access its source attribute by tacking on ['src']. This would work just the same for a hyperlink (.a['href']) or any other attibute. And if you’ve got multiple layers of nested tags, you can simply step down through them with a linked set of objects. For example, “b.a.string” would retrieve the string within a link within a bold tag. There’s great documentation on these and other Beautiful Soup tricks here.

After we’ve wrangled out the data we want from the HTML, the only challenge remaining is to print it out. I accomplish that above by loading the column values into a list called record and printing it out use a trick that will print them with a pipe-delimiter using the .join method.

    record = (rank, artist, album, cover_link)
    print "|".join(record)

Phew. That’s a lot of explaining. I hope it made sense. I’m happy to clarify or elaborate on any of it. But if you save the snippet above and run it. You should get a simple print out of the data that looks something like this:

10|LCD Soundsystem|Sound of Silver|http://www.palewire.com/scrape/albums/covers/sound%20of%20silver.jpg
9|Ulrich Schnauss|Goodbye|http://www.palewire.com/scrape/albums/covers/goodbye.jpg
8|The Clientele|God Save The Clientele|http://www.palewire.com/scrape/albums/covers/god%20save%20the%20clientele.jpg
7|The Modernist|Collectors Series Pt. 1: Popular Songs|http://www.palewire.com/scrape/albums/covers/collectors%20series.jpg
6|Bebel Gilberto|Momento|http://www.palewire.com/scrape/albums/covers/memento.jpg
5|Various Artists|Jay Deelicious: 1995-1998|http://www.palewire.com/scrape/albums/covers/jaydeelicious.jpg
4|Lindstrom and Prins Thomas|BBC Essential Mix|http://www.palewire.com/scrape/albums/covers/lindstrom%20prins%20thomas.jpg
3|Go Home Productions|This Was Pop|http://www.palewire.com/scrape/albums/covers/this%20was%20pop.jpg
2|Apparat|Walls|http://www.palewire.com/scrape/albums/covers/walls.jpg
1|Caribou|Andorra|http://www.palewire.com/scrape/albums/covers/andorra.jpg

See the difference?! Pretty cool, right?

But, really, you could of done that with copy and paste. Or, if you’re slick, maybe even Excel’s Web Query.

As with our previous recipes, the real efficiencies aren’t found until you can train your computer to repeat a task over a large body of data. One of the great things mechanize can do is step through pages one by one and help Beautiful Soup suck the data out of each. This is very helpful when you’re trying to scrape the search results from online web queries, which are commonly displayed in paginated sets that run into hundreds and hundreds of pages.

Today’s example is only two pages in length, though the principles we learn here can later be applied to broader data sets. But before we can run, we have to learn how to walk. So, in that spirit, here’s a simple expansion of our script above that will click on the “Next” link at the bottom of our example page and repeat the scrape on my 2006 list.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
 
def extract(soup, year):
 
    table = soup.find("table", border=1)
 
    for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
 
        rank = col[0].string
        artist = col[1].string
        album = col[2].string
        cover_link = col[3].img['src']
 
        record = (str(year), rank, artist, album, cover_link)
        print "|".join(record)
 
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
 
page1 = mech.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1, 2007)
 
page2 = mech.follow_link(text_regex="Next")
html2 = page2.read()
soup2 = BeautifulSoup(html2)
extract(soup2, 2006)

Note that our Beautiful Soup snippet remains the same as above, but we’ve moved it to the top of the script and placed it in a Python function called extract. Structured this way, the extract function is reusable on any number of pages as long as the HTML you’re looking to parse is formatted the same way.

The function accepts two parameters, soup and year, which are passed in the lower part of our script after Beautiful Soup captures each page’s contents. The first snippet …

page1 = mech.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1, 2007)

…essentially does the same thing as our early versions: visits the URL for my 2007 list and parses out the table. The only change is that the soup variable is now being passed to the extract function along with the year, so that it can be printed alongside the data columns in our output by adding it to the “record” list inside the function here:

        record = (str(year), rank, artist, album, cover_link)

I figured it’s a nice add since then our eventual results will contain a field that discerns the 2007 list from the 2006 list.

Now check out easy it is to get mechanize to step through to the next page.

page2 = mech.follow_link(text_regex="Next")
html2 = page2.read()
soup2 = BeautifulSoup(html2)
extract(soup2, 2006)

All it takes is feeding the link’s string value to mechanize’s follow_link() method and, boom, you’re walking over to the next page. Treat what you get back the same as we did our first “page” and, bam, you’ve done it. Save the script, run it, and you should see something more like this:

2007|10|LCD Soundsystem|Sound of Silver|http://www.palewire.com/scrape/albums/covers/sound%20of%20silver.jpg
2007|9|Ulrich Schnauss|Goodbye|http://www.palewire.com/scrape/albums/covers/goodbye.jpg
2007|8|The Clientele|God Save The Clientele|http://www.palewire.com/scrape/albums/covers/god%20save%20the%20clientele.jpg
2007|7|The Modernist|Collectors Series Pt. 1: Popular Songs|http://www.palewire.com/scrape/albums/covers/collectors%20series.jpg
2007|6|Bebel Gilberto|Momento|http://www.palewire.com/scrape/albums/covers/memento.jpg
2007|5|Various Artists|Jay Deelicious: 1995-1998|http://www.palewire.com/scrape/albums/covers/jaydeelicious.jpg
2007|4|Lindstrom and Prins Thomas|BBC Essential Mix|http://www.palewire.com/scrape/albums/covers/lindstrom%20prins%20thomas.jpg
2007|3|Go Home Productions|This Was Pop|http://www.palewire.com/scrape/albums/covers/this%20was%20pop.jpg
2007|2|Apparat|Walls|http://www.palewire.com/scrape/albums/covers/walls.jpg
2007|1|Caribou|Andorra|http://www.palewire.com/scrape/albums/covers/andorra.jpg
2006|10|Lily Allen|Alright, Still|http://www.palewire.com/scrape/albums/covers/alright%20still.jpg
2006|9|Nouvelle Vague|Nouvelle Vague|http://www.palewire.com/scrape/albums/covers/nouvelle%20vague.jpg
2006|8|Bookashade|Movements|http://www.palewire.com/scrape/albums/covers/movements.jpg
2006|7|Charlotte Gainsbourg|5:55|http://www.palewire.com/scrape/albums/covers/555.jpg
2006|6|The Drive-By Truckers|The Blessing and the Curse|http://www.palewire.com/scrape/albums/covers/blessing%20and%20curse.jpg
2006|5|Basement Jaxx|Crazy Itch Radio|http://www.palewire.com/scrape/albums/covers/crazy%20itch%20radio.jpg
2006|4|Love is All|Nine Times The Same Song|http://www.palewire.com/scrape/albums/covers/nine%20times.jpg
2006|3|Ewan Pearson|Sci.Fi.Hi.Fi_01|http://www.palewire.com/scrape/albums/covers/sci%20fi%20hi%20fi.jpg
2006|2|Neko Case|Fox Confessor Brings The Flood|http://www.palewire.com/scrape/albums/covers/fox%20confessor.jpg
2006|1|Ellen Allien & Apparat|Orchestra of Bubbles|http://www.palewire.com/scrape/albums/covers/orchestra%20of%20bubbles.jpg

Now all that’s left on our checklist is to figure out a way to download the cover art in addition to recording the urls. When we’re interested in just snatching a simple file off the web, I like to use the urlretrieve() function found in Python’s urlib module. All you have to do is add it to your import line, as below, and tell it where to save the files. I just stuff it in the extract loop so it pulls down the file immediately after scraping its row in the table. Check it out.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib, os
 
def extract(soup, year):
 
    table = soup.find("table", border=1)
 
    for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
 
        rank = col[0].string
        artist = col[1].string
        album = col[2].string
        cover_link = col[3].img['src']
 
        record = (str(year), rank, artist, album, cover_link)
        print >> outfile, "|".join(record)
 
        save_as = os.path.join("./", album + ".jpg")
        urllib.urlretrieve(cover_link, save_as)
        print "Downloaded %s album cover" % album
 
 
outfile = open("albums.txt", "w")
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
 
page1 = mech.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1, 2007)
 
page2 = mech.follow_link(text_regex="Next")
html2 = page2.read()
soup2 = BeautifulSoup(html2)
extract(soup2, 2006)
 
outfile.close()

While I was at it, I also added in an outfile where the scrape results are saved in a text file, just like we did in our previous recipes. Run this version and then check out your working directory, where you should see all the images as well as the new outfile.

Voila. I think we’re done. If this is useful for people, next time we can cover how you leverage these basic tools against search forms and larger result sets. Per usual, if you spot a screw up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.

And, as a postscript, since we’re kind of on a roll here, I thought it might be fun to cook up an LAT version of the Python Cookbook cover, in the classic O’Reilly style. What do you think? I couldn’t quite find the right font.

The Reporter's Python Cookbook

Python Recipe: Read a file, search for a pattern, print your matches.

Our first two recipes focused primarily on how to open one or more files and loop through them line by line. While we paid a little attention to how we could search for patterns using regular expressions, we didn’t try to do a whole lot with what we caught. Hell, we didn’t even try very hard to write a good regex.

But when you start to get serious about searching for patterns in text, one of the obvious goals is to single out and collect your matches. Maybe you want to pull all the phone numbers out of big blobs of text. Or email addresses. Or anything enclosed in quotation marks. Whatever.

Here’s one way to try it.

But before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here’s the example task I’ve selected to achieve our mission.

  1. Download the King James Version of the Holy Bible.
  2. Read through each line of text.
  3. Capture each four-letter word.
  4. Print them out.

Let’s do it.

1. Open the command line, create a working directory, move there.
We’re going to start the same way we did in the first two lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-search-and-capture
cd py-search-and-capture/

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

2. Download our source file, The King James Version of the Holy Bible

We’re going to use the text file provided Project Gutenberg as our source. As in the earlier lessons, I’m going to use the curl command-line utility to retrieve the file, but you should feel free to download it to our working directory using your web browser, if you prefer.

curl -O http://www.gutenberg.org/dirs/etext90/kjv10.txt

3. Create our python script in the text editor of your choice.

vim search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

#!/usr/bin/env python
import re
 
bible = open("kjv10.txt", "r")
 
regex = re.compile(r'\b\w{4}\b')
 
for line in bible:
    four_letter_words = regex.findall(line)
    for word in four_letter_words:
        print word

Our file opens by importing the re module, which will allow us to call upon Python’s regular expression library. We then open our Bible into a variable of the same name and, as in previous recipes, open a loop that will iterate through each line in the file.

The new stuff comes next. The first statement above our loop uses re’s compile method to store our regular expression pattern into a variable called “regex.” (As commenter Paddy suggested below, it’s a good idea to put it above the loop so it doesn’t have to be repeated on each iteration.) Remember, our goal is to match any four-letter words. There are three regex symbols I squished together to give it a hack. They are defined as follows. I’ve drawn the definitions from this Python reference, which can probably help you crack most nuts.

  • \b - Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word.
  • \w - Matches any alphanumeric character
  • {m,n} - There must be at least m repetitions, and at most n.

So when you piece them together like so, “\b\w{4}\b”, what you’re asking for is any stretch of four alphanumeric characters between two word boundaries. Make sense?

Next, equipped with our regex, we create another variable called “four_letter_words.” In it we see our regex variable pressed against a re method we haven’t used before. In the previous lessons we used the kludgy match() function to make our hits. Here we’re using something more elegant. It’s findall(), which will return all the matches within our line as a list. And by connecting it to our pre-compiled “regex” variable, we’re setting that as the pattern it should look for.

We can expect plenty of lines with more than one match, so we’ll set up another loop to run through “four_letter_words” and print out all the hits. And then we’re done. Save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…

python py-search.py

And, voila, there you have it. All the four-letter words in the KJV. Every f*cking one.

Unless I messed something up, of course. Per usual, if you spot a screw up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.

Python Recipe: Open multiple files, search for matches, count your hits.

I got some feedback from our beginners on the Python recipe I put up yesterday. They had a couple good questions about ways they can branch off, which I think we can cover pretty quick in another post.

To recap, Saturday’s script opened a single file (Shakespeare’s sonnets), searched the text line by line for a search term (”love”) using a basic regular expression, and then closed by printing the hits to a new text file. Today’s recipe will do all that, and a couple other things that might be helpful.

For reason’s discussed in my previous post, I think munching through text with Python is going to be most useful for a reporter when she can leverage its power against large bodies of text. Our first example only operated on a single file. Out there in the real world, with deadlines, diets and kids to pick up at soccer practice, why should we invest the time learning to write a computer script to process a single file when we might be able to hack out the job with CTRL-F and just be done with it?

I feel that.

So, let’s take the next step. Let’s learn how to crack open a whole directory full of files and slam each one through our wood chipper.

But before we get going, let me just say that I’m going to assume you read yesterday’s recipe and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here the example task I’ve selected to achieve our mission.

  1. Download the works of Friedrich Nietzsche.
  2. Train our computer to open the books one by one.
  3. Read through the text of each.
  4. Find all the lines that contain the german word for hate (hasse, hasst, hassen)
  5. Print out the hits.
  6. Count up the totals for each book and figure which one is the hatenest (das meisten hassten!).

Sound good? Let’s do it.

1. Open the command line, create a working directory, move there.

cd $HOME/Documents
mkdir py-search-multiple-files
cd py-search-multiple-files/
mkdir nietzsche

We’re going to start the same way we did yesterday, creating a working folder for all our files and moving in with our command line. The only difference this time is that we’re making an additional subdirectory to hold the source files we’ll be searching.

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

2. Download our source files, the works of Friedrich Nietzsche.

If you visit Project Gutenberg, you can find variety of Nietzsche’s work available for download. For our purposes, we’re going to take all of the books available there printed in the author’s native tongue, German. We could point and click our way through the process — visiting each book’s profile page and downloading its text to our new nietzsche folder — but if your aim is to become a big-time computer nerd, you might be interested in a command-line trick that can pull them all down with a single line of code.

Yesterday we used the curl utility to pull down our Shakespeare file. If you pulled that off, I’m sure you can easily imagine how it could be replicated with each of today’s files, provided that you know the right urls to hit. And I’m guessing it might look something like this.

curl -O http://www.gutenberg.org/dirs/etext05/7zara10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7ecce10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7gbrt10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7gtzn10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7jnst10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7msch10.txt

But, man, that hardly seems easier that clicking around, does it? Thankfully, one of the great things you pick up as you learn your way around the command line is that there’s almost always a way to trim down a repetitive task into an elegant, simple string of code. Here’s how those six separate curls can be combined.