One feed, straight steez.

I’ve got nothing but love for my Wizz RSS reader. But sometimes it’s still not enough to keep up. The more feeds I add, the clunkier it gets to click my way down through the list. And I find myself lazing out and only reading about half as much as I should.

So, in an effort to help myself better keep up on what’s going on, I’ve put together news.palewire.com, a feed aggregator that blends together the mix of pundits, blogs, delicious feeds and gossip sheets that I dig on. The topics tend toward newspapers (plight of), data analysis and news media geekery. It’s all brought together using Sam Ruby’s excellent, Python-based Planet Venus application, which I previously used to assemble Shawington.com. The one cool add this time around is Ruby’s “meme” plugin, which scans the feed pool for common links and ranks the past week’s most popular posts.

If it’s something you like, feel free to tune in. The site is mostly intended for my personal use, but it would be great if other people found it useful. So, if there are feeds you’d like to see thrown in, or changes that would help make your life easier, just let me know and I’ll try to do it up. I’m sure I left out a lot of great stuff, and I’m always out to improve my media diet.

Python Recipe: Read a file, search for a pattern, print your matches.

Our first two recipes focused primarily on how to open one or more files and loop through them line by line. While we paid a little attention to how we could search for patterns using regular expressions, we didn’t try to do a whole lot with what we caught. Hell, we didn’t even try very hard to write a good regex.

But when you start to get serious about searching for patterns in text, one of the obvious goals is to single out and collect your matches. Maybe you want to pull all the phone numbers out of big blobs of text. Or email addresses. Or anything enclosed in quotation marks. Whatever.

Here’s one way to try it.

But before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here’s the example task I’ve selected to achieve our mission.

  1. Download the King James Version of the Holy Bible.
  2. Read through each line of text.
  3. Capture each four-letter word.
  4. Print them out.

Let’s do it.

1. Open the command line, create a working directory, move there.
We’re going to start the same way we did in the first two lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-search-and-capture
cd py-search-and-capture/

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

2. Download our source file, The King James Version of the Holy Bible

We’re going to use the text file provided Project Gutenberg as our source. As in the earlier lessons, I’m going to use the curl command-line utility to retrieve the file, but you should feel free to download it to our working directory using your web browser, if you prefer.

curl -O http://www.gutenberg.org/dirs/etext90/kjv10.txt

3. Create our python script in the text editor of your choice.

vim search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

#!/usr/bin/env python
import re
 
bible = open("kjv10.txt", "r")
 
regex = re.compile(r'\b\w{4}\b')
 
for line in bible:
    four_letter_words = regex.findall(line)
    for word in four_letter_words:
        print word

Our file opens by importing the re module, which will allow us to call upon Python’s regular expression library. We then open our Bible into a variable of the same name and, as in previous recipes, open a loop that will iterate through each line in the file.

The new stuff comes next. The first statement above our loop uses re’s compile method to store our regular expression pattern into a variable called “regex.” (As commenter Paddy suggested below, it’s a good idea to put it above the loop so it doesn’t have to be repeated on each iteration.) Remember, our goal is to match any four-letter words. There are three regex symbols I squished together to give it a hack. They are defined as follows. I’ve drawn the definitions from this Python reference, which can probably help you crack most nuts.

  • \b - Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word.
  • \w - Matches any alphanumeric character
  • {m,n} - There must be at least m repetitions, and at most n.

So when you piece them together like so, “\b\w{4}\b”, what you’re asking for is any stretch of four alphanumeric characters between two word boundaries. Make sense?

Next, equipped with our regex, we create another variable called “four_letter_words.” In it we see our regex variable pressed against a re method we haven’t used before. In the previous lessons we used the kludgy match() function to make our hits. Here we’re using something more elegant. It’s findall(), which will return all the matches within our line as a list. And by connecting it to our pre-compiled “regex” variable, we’re setting that as the pattern it should look for.

We can expect plenty of lines with more than one match, so we’ll set up another loop to run through “four_letter_words” and print out all the hits. And then we’re done. Save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…

python py-search.py

And, voila, there you have it. All the four-letter words in the KJV. Every f*cking one.

Unless I messed something up, of course. Per usual, if you spot a screw up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.

Python Recipe: Open multiple files, search for matches, count your hits.

I got some feedback from our beginners on the Python recipe I put up yesterday. They had a couple good questions about ways they can branch off, which I think we can cover pretty quick in another post.

To recap, Saturday’s script opened a single file (Shakespeare’s sonnets), searched the text line by line for a search term (”love”) using a basic regular expression, and then closed by printing the hits to a new text file. Today’s recipe will do all that, and a couple other things that might be helpful.

For reason’s discussed in my previous post, I think munching through text with Python is going to be most useful for a reporter when she can leverage its power against large bodies of text. Our first example only operated on a single file. Out there in the real world, with deadlines, diets and kids to pick up at soccer practice, why should we invest the time learning to write a computer script to process a single file when we might be able to hack out the job with CTRL-F and just be done with it?

I feel that.

So, let’s take the next step. Let’s learn how to crack open a whole directory full of files and slam each one through our wood chipper.

But before we get going, let me just say that I’m going to assume you read yesterday’s recipe and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here the example task I’ve selected to achieve our mission.

  1. Download the works of Friedrich Nietzsche.
  2. Train our computer to open the books one by one.
  3. Read through the text of each.
  4. Find all the lines that contain the german word for hate (hasse, hasst, hassen)
  5. Print out the hits.
  6. Count up the totals for each book and figure which one is the hatenest (das meisten hassten!).

Sound good? Let’s do it.

1. Open the command line, create a working directory, move there.

cd $HOME/Documents
mkdir py-search-multiple-files
cd py-search-multiple-files/
mkdir nietzsche

We’re going to start the same way we did yesterday, creating a working folder for all our files and moving in with our command line. The only difference this time is that we’re making an additional subdirectory to hold the source files we’ll be searching.

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

2. Download our source files, the works of Friedrich Nietzsche.

If you visit Project Gutenberg, you can find variety of Nietzsche’s work available for download. For our purposes, we’re going to take all of the books available there printed in the author’s native tongue, German. We could point and click our way through the process — visiting each book’s profile page and downloading its text to our new nietzsche folder — but if your aim is to become a big-time computer nerd, you might be interested in a command-line trick that can pull them all down with a single line of code.

Yesterday we used the curl utility to pull down our Shakespeare file. If you pulled that off, I’m sure you can easily imagine how it could be replicated with each of today’s files, provided that you know the right urls to hit. And I’m guessing it might look something like this.

curl -O http://www.gutenberg.org/dirs/etext05/7zara10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7ecce10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7gbrt10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7gtzn10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7jnst10.txt
curl -O http://www.gutenberg.org/dirs/etext05/7msch10.txt

But, man, that hardly seems easier that clicking around, does it? Thankfully, one of the great things you pick up as you learn your way around the command line is that there’s almost always a way to trim down a repetitive task into an elegant, simple string of code. Here’s how those six separate curls can be combined.

cd nietzsche
curl -O "http://www.gutenberg.org/dirs/etext05/7{zara,ecce,gbrt,gtzn,jnst,msch}10.txt"
cd ..

Remember how we used the (L|l) option statement in our regular expression yesterday to match our search pattern to phrases containing either an upper or lowercase ‘L’? We can do a similar thing here with curl, reducing the six urls to their common parts and providing a list of options between the {}’s where we plug each link’s unique string. We just use “cd” to commute down to our subdirectory and back. For more details on how curl works, try typing in

curl --help

or

curl --man

Each should include instructions on all other sorts of crazy tricks you can pull off. And if you have something in mind, don’t forget to ask our good friend Google.

If you can’t get curl to work on your system, but you still want to play along, just go ahead and download the Nietzsche files one by one through your web browser. As long as you put them in the subdirectory we named after him, the stuff that follows should still work just fine.

3. Create our python script in the text editor of your choice.

vim search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
import re, os
 
path = "./nietzsche"
freddys_library = os.listdir(path)
 
for book in freddys_library:
    file = os.path.join(path, book)
    text = open(file, "r")
 
        for line in text:
            if re.match("(.*)(hasse|hasst)(.*)", line):
                print line,

Here’s what we’ll start with. If you cover up the top part of the script with your hand, you’ll see that the three statements at the end look almost identical to what we wrote in the first lesson. The script iterates through each line in a file (in this case dubbed “text”), seeks out a match using the same methods described in detail yesterday, and then finally prints out cases where we find a hit.

The only major difference is that we’ve replaced portions of yesterday’s statement designed to seek out variations on the word “love” with another quick-fix regex designed to net the common German forms of the word hate (hasse, hasst, hassen).

And then we’ve got all that junk up there above it. What’s going on there?

The first thing to notice is that we added another module to the import statement. In addition to the “re” module we’re using to match regular expressions, we’ve also introduced the “os” module. The os library hooks you up with a bunch of easy ways to pull in basic information about your operating system and file structure for use in Python. Our next two statements put it to use right away. First we store our nietzsche subdirectory in a variable called “path,” which is then passed to the os function listdir(). That will return a list of all the files contained within the directory. Regardless of how few, or how many, are stuffed down in there, the filenames will now all be stored in our second variable, “freddys_library.”

Our next step is to open up a loop that will iterate through each file name in “freddys_library.” Since the function simply returns each file’s name, not its path, we have to link the two before we can open the file. So the first step is another os function brought in to meld the two. Then we’re free to open the file the same way we did yesterday, which leads the way to the search-and-print loop we’re already familar with. And since it’s stored within the loop stepping through each book’s file, it will be repeated for every title before the script ends.

Now save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…

1
python py-search.py

…and, voila, you should now have every line in Nietzsche that contains the word hate flying by on your screen.

Now here’s the next set of tricks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python
import re, os
 
path = "./nietzsche"
freddys_library = os.listdir(path)
 
hate = open("hate.txt", "w")
 
for book in freddys_library:
    file = os.path.join(path, book)
    text = open(file, "r")
 
    hit_count = 0
    for line in text:
        if re.match("(.*)(hasse|hasst)(.*)", line):
            hit_count = hit_count + 1
            print >>  hate, book + "|" + line,
 
    print book + " => " + str(hit_count)
    text.close()

This second snippet is identical to our first draft, with a few additions. The simplest change is first, to create a new file (”hate.txt”) where our matches are now printed. You’ll notice that the print statement has also been modified to output the book’s file name and a pipe-delimiter along with each hit on hate. So each line in your out file should be labeled with the source file where it was found.

The second change is to introduce a new “hit_count” variable designed to keep a running count of the matches found in each book and report back the results. Since it’s enclosed within the outer loop, the first “hit_count = 0″ variable will reset the number to nil on each book’s iteration. And then the placement of “hit_count + 1″ within the subsequent if statement will click the variable’s total up one each time a match is made and the interpreter runs through that portion of the script. The final touch is to close each run through the loop by printing the book’s file name along with the total number of hits found after all of the lines had been evaluated. The number is enclosed in a str() function so that it’s converted from an integer into a string, which can be easily concatenated with other strings for our print statement.

When you run version two, it’ll now print out the total number of hits for each book, looking something like this:

7msch10.txt => 13
7zara10.txt => 34
7ecce10.txt => 2
7gtzn10.txt => 5
7gbrt10.txt => 2
7jnst10.txt => 4

It works, but it’s pretty ugly. How can you tell the different books apart without memorizing their file names? Good thing we can fix that too. Check out how.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
import re, os
 
title = {
    "7ecce10.txt": "Ecce homo, Wie man wird, was man",
    "7gtzn10.txt": "Gotzen-Dammerung",
    "7msch10.txt": "Menschliches, Allzumenschliches",
    "7gbrt10.txt": "Die Geburt der Tragodie",
    "7jnst10.txt": "Jenseits von Gut und Bose",
    "7zara10.txt": "Also sprach Zarathustra"
}
 
path = "./nietzsche"
freddys_library = os.listdir(path)
 
hate = open("hate.txt", "w")
 
for book in freddys_library:
    file = os.path.join(path, book)
    text = open(file, "r")
 
    hit_count = 0
    for line in text:
        if re.match("(.*)(hasse|hasst)(.*)", line):
            hit_count = hit_count + 1
            print >>  hate, title[book] + "|" + line,
 
    print title[book] + " => " + str(hit_count)
    text.close()

After referring back to our text files to figure out which files contain which books, I made the Python “dictionary” at the top of this snippet. It pairs up the files with the titles for later reference, which happens easy peasy there at bottom when the loop’s current “book” variable is run against the dictionary to return its title for our output.

Now if you save your changes and fire it off again, you should be getting something more like this:

Menschliches, Allzumenschliches => 13
Also sprach Zarathustra => 34
Ecce homo, Wie man wird, was man => 2
Gotzen-Dammerung => 5
Die Geburt der Tragodie => 2
Jenseits von Gut und Bose => 4

Much nicer, nein?

Alright. That’s all for tonight. I hope this helps y’all kick the can a little further down the road. Per usual, if I’ve screwed something up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.

Python Recipe: Open a file, read it, print each line matching a search.

A couple of friends out there are valiantly teaching themselves the Python programming language in their free time. Who are they? Hack reporters like me, picking up computer skills in a continuing quest to better sift, organize and analyze information. And, in the process, maybe keep our jobs.

There are a couple great books available free online but it’s pretty tough to start stringing all the fundamentals into a problem-solving script all on your own. So why not write up some simple recipes that attack problems common to our particular tribe?

One of the ways computer programming can be of great use to a reporter is as a text parser. We all have more documents than we have time. So a common challenge is training your computer to read through a big blob of text and return any hits on terms you’re interested in (i.e. the name of the mayor, a popular pesticide, a roster of local police officers).

If it’s a one-off effort, you can probably get this done quickly using search tools included in common quality text editors (ex. Ultraedit, Notepad++, TextMate). But if you’ve got a steady stream of files, like a weekly dump of court filings, or a really big bad file, sometimes it’s preferable to train your computer to do the work for you.

In that spirit, the following instructions are designed to show you how to use Python to search through a text file (The Sonnets of William Shakespeare), find any lines that contain our sample search term (”love”), and then print out the hits into a new file we can keep as a memento.

We’ll be dealing with a source file that’s probably cleaner than most documents you’ll get from the government, and certainly a lot tidier than anything you’ve converted from a PDF file using an OCR application, but if you’re a totally newbie, my hope is that this can help you get a grip on how the hell all the pieces described in the textbooks fit together into something almost useful.

Since I’m now a full-time geek, I do most of my work on computers that run some flavor of Linux. The step-by-step instructions that follow will walk you through each keystroke on the command line in Ubuntu, which is what I run at home. But since most people who might be interested in this are probably running Windows XP or Mac OS X, I’ll try to include translations as we go.

The one prerequsite for the whole endeavor is that you already have a working installation of Python. If you’re working in Windows and you don’t, I’d recommend visiting ActiveState and downloading the installer for their ActivePython distribution. If you’re rocking a Macbook, you can find out whether you’re rolling with Python by opening your terminal and entering the following:

which python

If you’ve got it properly installed, it should return something like

/usr/bin/python

If it’s not working out, I’d recommend the installation instructions in Mark Pilgrim’s excellent book, Dive Into Python.

Alright, with all that out of the way, let’s get to the recipe.

1. Open the command line, create a working directory, move there.

cd $HOME
mkdir Documents/py-search
cd Documents/py-search

The three commands above, which should work just as easily in Mac as in Linux, will move us to our home directory, create a new subdirectory in your Documents folder, and relocate to the new folder.

If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

2. Download our source file, The Sonnets of William Shakespeare.

curl -O http://www.gutenberg.org/dirs/etext97/wssnt10.txt

The line above uses the curl command line utility to download a copy of Shakespeare’s work from the Project Gutenberg Web site. Mac users with curl installed should be able to issue the same command. Windows users, or anyone without curl, will probably be able to most easily snatch the file just by visiting the link in a web browser and saving the file to the working directory created in step one.

3. Create our python script in the text editor of your choice.

vim py-search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

1
2
3
4
5
6
7
8
#!/usr/bin/env python
import re
 
shakes = open("wssnt10.txt", "r")
 
for line in shakes:
    if re.match("(.*)(L|l)ove(.*)", line):
        print line,

If, like my friends, you’ve been working through some common Python tutorials, I’m guessing a lot of that looks familar to you.

The first line is a “shebang” that, on execution of the file, instructs the computer to process the script using the python interpreter. The “import re” pulls in Python’s standard regular expression module so we can use it later to search the text. The open() command grabs the Shakespeare file we’ve just downloaded and opens it up. The “r” is for read mode.

The three staggered statements that follow are a loop that runs through each line in the document, as dictated by the first statement. The second statement uses the re.match() function we imported at the top to evaluate the latest line on each iteration through the loop by testing it against that scary looking mess in its first parameter.

So, what is that thing? “(.*)(L|l)ove(.*)”, say what?

That’s a regular expression I designed specifically to catch any instances of the search term I’m after. If you’re not familar with regular expressions, they’re a powerful language for matching strings of text. When you first get started, they can be a bit intimidating, but once you learn a couple tricks, you’ll quickly see how useful they can be. One of my favorite geek jokes is this cartoon on the utility of a well-timed “regex”

So how does it work? There are two tricks to learn. Remember, our goal is to find any line in Shakespeare’s sonnets that include the word love. But, when we think about it, we can’t just search for “love” because our loop is evaluating the text line by line, not word by word. So if we just ask for “love,” we’d only get lines that include only the word “love.” Plus the word could appear in any number of common grammatical variations (ex. “Love,” “lover,” “lovesick,” “self-love”) that we’d also like to capture.

That’s where the regular expressions come in. You’ll notice that the expression is bracketed by two “(.*)” statements. In regular expression language, the “.” command matches any string and the “*” repeats whatever command precedes it zero or more times, so together they will match any string of any length. When bracketed around a search term, like “love,” it should return a match on a line of text regardless of where in the line “love” appears. In other words, it would match “She loves you,” “love is a many spendored thing” or “ain’t talking ’bout love.”

But, all by itself “(.*)love(.*) wouldn’t match “America: What Time is Love?” or “Love Is Only A Feeling.” Why not? Because those songs have an uppercase L and we’re just asking for lowercase. Bummer, right?

One way to fix that would be to add an option that gives the regular expression variations on the term to look for. You can do that by adding another parenthesis set and separating the options with a “|” pipe. That’s where the “(L|l)” above came from. Combine that with the (.*) commands and we should have a quick and dirty regex to catch the lines we’re after. Though quick studies will catch a flaw in the design. As we’ll see in our result set later, this sort of dragnet approach will also yield hits on things we might not want to catch, words like “glove” and “lovely” will match just as easily as “lovesick” or “lover.” Feel free to tweak the statement and try to finetune your results. There’s a ton more you can do with regular expressions than what I’ve described. So don’t take my example too seriously. I just wanted to show off a couple of the most common regex commands.

5. Save your script and run it.

If you’re working along with me in vim, you’ll need to save your work before exiting. The easiest way to do that is to exit insert mode by hitting the ESC key and then hold SHIFT and hit the Z key twice in a row. If you’re working in your own text editor, just save it however you’re comfortable.

Now jump back onto the command line resting in your working directory and tell python to fire that mother off.

python py-search.py

Voila. There they are, flying across your screen is every line in Shakespeare’s sonnets containing the word love. And if you wanted to print them out to a new text file, rather than just dump them on the screen, jump back into your script and try something more like this.

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env python
import re
 
shakes = open("wssnt10.txt", "r")
 
love = open("love.txt", "w")
 
for line in shakes:
    if re.match("(.*)(L|l)ove(.*)", line):
        print >> love, line,

Now just open love.txt and you should find the same results as before.

The only difference in this script is that we’re now opening an outfile called love (notice that it’s “w” mode, for write, rather than “r” mode like the source) and modifying our print line to kick the results there, instead of the console.

That’s all folks. Per usual, if I’ve screwed something up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.