Five ways your data app can catch the big news hook.

01. Practice news-driven development

Most data-driven news applications I’ve encountered follow what I would call The Chicago Crime model, a name lifted from Adrian Holovaty’s famous site. Steady streams of government-provided data are repurposed into a flexible interface that allows users to compare disparate sources (“the mashup”) and easily localize the information so it can provide particulars to a wide body of users (“the long tail”).

It’s a brilliant model, the app that launched a 1,000 ships. But it’s not the only way to get things done.

In news terms, where minutes matter, it can still require a relatively long time to do. Especially when it comes to data acquisition. Let’s face it, if you’re using government data as your starting point, the idea of an SOAPy API is laughable. So don’t get your hopes up. Goofing around with delicious tags or Flickr photos is fun, but if you want to do something original from the public sector, they’re only going to get you so far. You’re going to be FOIA’ing, or, if you’re lucky, scraping. And then you’re going to be cleaning. Especially if you’re invested in serving accurate and consistant information. Because if there’s a government database out there that’s ready to serve, I’ve yet to see it.

And there’s usually not much of a news hook. Look, I appreciate Everyblock and Chicago Crime and that whole style. Hell, I’ve essentially remodeled my career to emulate them. But when you get down to it, they’re essentially built around the idea that umpteen little news hooks (”Someone was robbed in my neighborhood,” “A liquor store wants to open up on your block.”) will add up to something greater than the sum of their parts. That “hyperlocal” or “long tail” philosophy, to use the parlance of our time, may ultimately be where a lot of us end up, but blockbuster news is still happening and there’s no reason all the same tools that made the Chicago Crime successful can’t be used to cover the hell out of a big story when it breaks.

I had just such an opportunity last Friday at the L.A. Times. Late in the afternoon, news broke that a commuter train had crashed in the Valley, potentially killing many riders on board. We didn’t know how many fatalities to expect, nor how long it would take for their identities released. But we knew that our audience was going to want to know, and as soon as possible. The typical newspaper.com way to handle this sort of thing is to publish a simple list, or “blob of text”, when it’s available. And then follow up later with a scattershot of obituaries, usually released as they appear in the paper. But, when you think about it in terms of the Holovaty manifesto and the general concept of the Internet, there’s really no reason that information couldn’t be better collected and presented as a browsable database application. It’s a lesson the LA Times learned earlier this year when our ripoff of Adrian’s Faces of the Fallen concept reinvigorated the way the paper covers military casualties.

It meant staying late at work on a Friday night, busting ass most of my weekend, and putting more faith in memcached than most IT people are comfortable with, but the result was that when the government finally did cough up the fatality list we were ready to immediately publish it as a linked database that, over time, has been filled in by further reporting to include greater detail, photos, and more than 1,600 user comments, many of them extremely moving. It’s a long way from perfect, but it provided some amount of public service, was way ahead of the competition and generated a pretty goodly amount of traffic along the way. The site is called Chatsworth Metrolink Crash.

That’s all my long way of saying that I think big events matter and that database journalists shouldn’t be afraid to dive in when they happen. Whether it’s posting the location of hurricane shelters, letting people know who the hell all those superdelegates are, or connecting survivors following a disaster, there are plenty of obvious opportunities to do our thing. But it’s not going to happen if we don’t see taking on big news as an opportunity, anticipate things like the next hot Google search term, or have the capability to deploy very very quickly.

I’m a long way from an authority on the whole deal, but I’m stumbling my way through it. And here are a couple things I’ve learned along the way.

02. Let last year’s data be your guide.

Earlier this month, we released California Schools Guide, a collection of data about public and private schools across the state, at the very moment the government lifted its embargo on this year’s scores. I didn’t have the newsworthy data in hand until less than 24 hours before it would be publicly released. But by developing the site in advance using the previous year’s data as dummy entries, I was able to pre-script the loading of the 2008 data after only a few minor changes to the code. This meant that we were able to get our product out when the news hook dropped, at the same time as the paper was otherwise promoting an investigative story on the topic and the state’s propaganda arms were blasting its own message (”Things are getting better! Trust us!”).

03. Don’t Repeat Yourself, unless it saves you time.

Let me be clear. The DRY goal of elegence through efficiency is laudable. And, as a guiding principle for development, you probably can’t get any better. It is the single point of truth. It’s like natural selection, except for awesomeness. But when you’re on a tight deadline, and you’ve already got a code implementation that works, sometimes you JDFWI, Just Don’t Fuck With It. Yeah, so maybe you just copied and pasted and introduced a little redundancy. And maybe your css is just a hodgepodge of div’s repurposed from other apps. But it works, right? And what’s more important, trimming down your code base, or getting the news out ahead of your competition?

04. Use Django’s admin to your advantage.

For anyone who’s already doing this stuff, it probably goes without saying, but Django’s admin is really great. As soon as your database models are written, you’ve instantly got a set of entry forms that are ready to deploy. This is incredibly useful when trying to turn around simple data apps on deadline. For instance, when it came to the Metrolink crash, I was able to get the models and admin up Friday night so that reporters on Metro desk could begin working on entry as I shifted to work on the views and templates.

05. Publish now, or perish.

You can have the greatest app in the world, but if you can’t push it out the web ASAP, you’re nowhere. If you’re going the Chicago Crime route, this isn’t as big of a deal. But if you’re trying to hit the big news hook, it’s utterly essential. And treating big news like you would anything else on your “product schedule” or “iteration cycle” just isn’t going to be good enough. You can call it a waterfall, you can call it reckless, you can call it news-driven development.

Tickertube, Ben’s first stab at Amazon Web Services.

Yesterday I launched Tickertube.org, my first attempt at hosting a site using Amazon’s EC2 service. It’s a simple app, just an ever refreshing list of links from sites that write about telecommunications policy. I used to cover this stuff in DC, and I don’t really like using RSS readers, so it’s useful for me, if not anyone else.

But my objective isn’t to build a hit site. I just want to figure out Amazon’s toys. What I learned is that while they aren’t all that well documented, they can be a lot of fun once you figure out the basics. You’ll have to do more hands-on server configuration than you would with Google App Engine, but greater control does come with benefits.

I’d like to use Tickertube to woodshop a little in developing for smart phones. But since I don’t have an iPhone or Blackberry, I don’t have any way to test it out. Or a lot of motivation to get it done. But if somebody out there would like to use the site with a mobile device (and wouldn’t that be a shock!), just let me know and I’ll try to put in the extra time to adapt the HTML. Same goes if there are any feeds you’d like me to add the pool. Just shout.

Thanks to all the great tools that made this project easy. Besides Amazon, much love to Django, YUI and Feedjack.

Ben’s hip hop Twitter bot.

Has anyone else seen @hemingway, this weird Twitter feed that just spouts Ernie quotes every once in a while? Well, tonight I decided to code up my own twist on the idea. Follow @mistadobalina to receive hourly bursts of verse from one of my favorite albums, I Wish My Brother George Were Here by Del Tha Funkee Homosapien.

The whole thing is automated by about 30-45 minutes worth of work. So don’t expect any miracles. But all the code is over on github if anybody wants it. I had a couple problems (no matter what album I asked for, I was only getting track listings for Staind), but the LyricWiki SOAP service is a pretty sweet Web service.

California’s War Dead.

This Memorial Day weekend marked the formal launch of California’s War Dead, our database of the state’s casualties from the wars in Afghanistan and Iraq. It’s the result of a lot of hard work by many people at the paper, a large share of which had already been carried through the years by our many obituary writers.

The site intends to allow users to explore the data using a variety of criteria (for example, you can quickly look up fallen troops by hometown, high school or marital status). And to learn more about individuals by reading their obituaries from our back archives. Choice quotes have been selected to “pop” out of the individual profile pages and visitors are encouraged to leave memories and thoughts as comments.

Besides all my coworkers who pitched in to make this happen on a tight deadline, thank yous should be extended to all the great developers in the Django community. They not only provided the Web programming tools that made this idea possible, but also the leadership that showed me how the tools can be used to make journalism for the Web, not just on the Web. The same goes for all the people in the NICAR community who, by leading by example, have pushed me to keep learning new things and have the courage to take chances outside of journalism’s well worn comfort zones. Personally, I just hope that first group can forgive me for ripping off their ideas and that the second group doesn’t resent my getting the opportunity to do things like this without having to put in the once requisite 5 to 10 years on the cops-and-courts beat.

If you’re stretched for time, or maybe doubting there’s anything new to be learned about the war, let me promote a couple spots that might interest you.

  • Over the course of assembling the data, I was surprised to learn how many immigrants to California have died. It’s more than fifty, from Mexico and the Phillipines and South Korea and a number of other places. Check out the lists here. A fascinating story is of Sgt. Rafael Peralta of San Diego, who enlisted the same day he received his Green Card and died in Fallouja, Iraq, when he sacrificed himself to save his compatriots from a grenade attack. His profile is here and the story of his heroic death is here.
  • The most rewarding part of the project for me has been to see how quickly we’re getting great, thoughtful comments submitted by friends and family members of the deceased. One of my goals in the design was to give their writing equal footing with our previous reporting. It can be heartbreaking to read, but I’m proud to have helped make something that people think is worthy of such sensitive information. Examples I find particularly moving are the memories shared by the family of Sgt. Jason J. Buzzard of Ukiah and Corporal Christopher D. Leon of Lancaster, who I’m honored to know better now than I did before our commentors contributed.
  • It seems natural to expect that spending so much time with casualty data would have a numbing effect. But I think that’s only the case when we let the very real people we’ve lost remain numbers in a casualty count or unknown names on a page. It’s the stories that bring them to life, and my experience has been that the more stories you hear, the less numb you feel. The pain is in the details. A moving example is Teresa Watanabe’s obituary of Lt. Mark J. Daily of Irvine, who was inspired to join the war by the political writing of war advocate Christopher Hitchens. Hitchens has since gone to write a moving response to learning of Daily’s readership, and sacrifice, that you can find here.

Python Recipe: Connect to a MySQL database, execute a query, print the results.

Today let’s take a look at how you can use Python to connect to your MySQL database, issue a SQL query and do things with the results.

It’s not that hard to write a simple SQL command by hand and muck with the results, so why bother? Like earlier recipes, our reward will be saving time and energy by automating a task that would normally require manual labor. It becomes clear when you bump into something you need to do 500 times, or make part of your daily routine.

For example, one thing I try to do at work is keep documentation related to all of my datasets. I keep a series of wiki pages devoted to each database that lists its data sources, explains its origins, catalogs helpful SQL snippets and defines the fields in each table.

The wiki application I use — TWiki — accepts HTML code and the convention I’ve settled on is to format each table’s field definitions in a standard set of table tags. So every time I add a new MySQL table and want to document it, I need to build another HTML table for the wiki. That’s a pain to do by hand, especially when the table has a lot of fields. To save Ben the trouble, let’s write a simple Python script that will…

  1. Log into the MySQL database
  2. Acquire the list of columns from a MySQL table
  3. Print the columns out in an HTML table

It’s not very exciting, but it’ll introduce you to the basics. And once you start walking you’ll quickly be able to run.

1. Install the MySQLdb module.

Before you can tap into your MySQL db with Python, you need to install the “MySQL for Python” module (a.k.a. MySQLdb). The file is found here, and, while I haven’t tested it, it looks like there’s a .exe installer for you Windows kids. There’s also a nice stab and cataloging different methods here. If, like me, you’re running Ubuntu Linux, installation is as simple as opening GNOME’s package manager and selecting the “python-mysqldb” package, or running an “apt-get” from your terminal.

You can test whether its been properly installed by opening up your python shell and trying to import the module. So fire up the shell…

python

…and pop it off…

Python 2.5.1 (r251:54863, Mar  7 2008, 04:10:12)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>>

If the interpreter accepts the commands and kicks down the next line without an error, you know you’re okay. If it throws an error, you know something is off.

2. Open the command line, create a working directory, move there.

Before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

We’re going to start the same way we did in the first lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-connect-to-mysql
cd py-connect-to-mysql/

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

3. Create our python script in the text editor of your choice.

vim py-connect-to-mysql.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
import sys, MySQLdb
 
def PrintFields(database, table):
    """ Connects to the table specified by the user and prints out its fields in HTML format used by Ben's wiki. """
 
    host = 'localhost'
    user = 'user'
    password = 'password'
    conn = MySQLdb.Connection(db=database, host=host, user=user, passwd=password)
    mysql = conn.cursor()
 
    sql = """ SHOW COLUMNS FROM %s """ % table
    mysql.execute(sql)
 
    fields=mysql.fetchall()
 
    print '<table border="0"><tr><th>order</th><th>name</th><th>type</th><th>description</th></tr>'
    print '<tbody>'
 
    counter = 0
    for field in fields:
        counter = counter + 1
        name = field[0]
        type = field[1]
        print '<tr><td>' + str(counter) + '</td><td>' + name + '</td><td>' + type + '</td><td></td></tr>'
 
    print '</tbody>'
    print '</table>'
 
    mysql.close()
    conn.close()
 
users_database = sys.argv[1]
users_table = sys.argv[2]
print "Wikified HTML for " + users_database + "." + users_table
print "========================"
PrintFields(users_database, users_table)

Obviously, the first thing you need to do is import the modules we need. The “sys” module will allow us to accept inputs from the command line later, and our new friend, MySQLdb, will help us connect to the database.

Then you can see that the bulk of the script is taken up by a function, named PrintFields, that extends from line four to line 32. That contains all of the card tricks we’ll need to connect to our local db, run a query and print it out however we want. In this case, we’re spitting out the data in an HTML shell I’ve concocted to create a complete table when it’s all done.

After the loop closes at line 32, the remainder of the script uses sys to grab the first two arguments passed in by the user and hand them over to the function. So this way, by providing the database and table names at the time of execution, I can ask the script to print out the fields from any table I’ve got. For instance, we can print out the generic time_zone table that comes MySQL’s “mysql” settings database like so…

python py-connect-to-mysql.py mysql time_zone

And you should get something like …

Wikified HTML for mysql.time_zone
========================
<table border="0"><tr><th>order</th><th>name</th><th>type</th><th>description</th></tr>
<tbody>
<tr><td>1</td><td>Time_zone_id</td><td>int(10) unsigned</td><td></td></tr>
<tr><td>2</td><td>Use_leap_seconds</td><td>enum('Y','N')</td><td></td></tr>
</tbody>
</table>

Then what I’d normally do is just copy and paste that into my wiki. The script doesn’t have any error handling or fancy tricks, which I think makes it a good starter example on the basics. Let’s pull out the things you’ll need to know. So let’s walk through a few of them.

7
8
9
    host = 'localhost'
    user = 'user'
    password = 'password'

This first snippet contains all the local information about your MySQL that will need to be customized to fit your rig. You’ll need to change the definitions for user and password to whatever it is you use. And if the database you want to tap isn’t on your localhost, but perhaps networked elsewhere, you’ll need to change the host definition to its IP address or alias.

10
11
    conn = MySQLdb.Connection(db=database, host=host, user=user, passwd=password)
    mysql = conn.cursor()

Then this next step will pass all of your local specifics to MySQLdb and open up a “cursor.” You can use that to interact with the database in the same way you normally would with Microsoft Access or another piece of GUI software. A lot of people might name the variable containing the cursor as “cursor,” but it really doesn’t matter. As seen above, I like to name it mysql. It’s just personal preference. Notice that we didn’t define the database variable in the earlier snippet. That’s because it’s being passed into the function by the user. We know that because it’s up there at the top of our function…

4
def PrintFields(database, table):

…and fed in at the bottom after we capture the user input with sys…

34
users_database = sys.argv[1]

…and then passed in with our function call…

38
PrintFields(users_database, users_table)

But what the hell is in that argv variable, anyway? Let’s find out. I’m going to edit our script to include the following line…

print sys.argv

…run the script again…

python py-connect-to-mysql.py mysql time_zone

…and here’s what we get…

['py-connect-to-mysql.py', 'mysql', 'time_zone']

Pretty self-explanatory, right? Now back to our function.

13
14
    sql = " SHOW COLUMNS FROM %s " % table
    mysql.execute(sql)

With our connection set, the next thing to do is to use MySQL to run a query. I do it by storing a SQL command in a variable and then passing it to to MySQLdb execute function. You’ll note that I use Python’s magic “%s” command to write in the contents of the table variable, which, like the database variable, has been passed into the function by the user. By doing that, we’re now able to run that same SHOW COLUMNS command on whatever database and table combination we pass in. Provided the table actually exists.

It’s simple tricks like that which will enable you to really start flying with automation. What if your job required you to kick out data reports for each or any of the 50 states on command, or if you wanted to automate a web scrape to deposit its findings in your database every time it runs, or maybe even make an RSS feed that updates every 15 minutes. A simple concept like this, which allows for some of the SQL specifics to be specified programmatically, could save you a ton of time.

16
   fields=mysql.fetchall()

This next line will store the results of the query in a variable called fields, which we’ll print out using one of the simple loops I covered in previous recipes, pulling out the first and second fields (name and datatype) for my table. You could take a look what’s in the list by printing it out to your terminal before running the loop. Let’s try that by adding…

print fields

…to our script. Run it again from the top and now you’ll get the …

(('Time_zone_id', 'int(10) unsigned', 'NO', 'PRI', None, 'auto_increment'), ('Use_leap_seconds', "enum('Y','N')", 'NO', '', 'N', ''))

Since I don’t want all of the settings for my docs, just the name and field type, as we loop through each row I only pull out the first two items (field[0], field[1]). All the rest of the mess around there is designed to print out the data in my custom HTML shell, which really shouldn’t matter for your purposes, so why bother here. So, what the hell, I think we’re done. Per usual, if you spot a screw up, or I’m not being clear, just shoot me an email or drop a comment and we’ll sort it out. Hope this is helpful to somebody.

Python Recipe: Print a future date in the format you want.

Enough with all the talky talky, here’s a simple snippet I cooked up for a friend this morning to solve his problem of the moment: how to coax Python into printing out a future date (6 weeks in the future, to be exact) in the format he wants. Hope it’s useful to somebody. Let me know if I screwed anything up.

>>> import datetime
>>> now = datetime.datetime.now()
>>> print now
2008-04-21 10:19:35.832928
>>> from datetime import timedelta
>>> diff = datetime.timedelta(days=42)
>>> print diff
42 days, 0:00:00
>>> print now + diff
2008-06-02 10:19:35.832928
>>> future = now + diff
>>> future.strftime("%m/%d/%Y")
'06/02/2008'

Documentation on how you can customize strftime to print dates in the format you need can be found here. Scroll down to the middle-ish part of the page.

Python Recipe: Grab a page, scrape a table, download a file.

Here’s a change of pace. Our first few lessons focused on how you can use Python to goof with a bunch of local files. This time we’re going to try something different: using Python to go online and screw around with the Web.

Whenever I caucus with aspiring NICARians and other data hungry reporters, it’s not long before the topic of web scraping comes up. While automated text processing and database management may sound well and good, there’s something sexy about pulling down a fatty government database that catches people’s imagination and inspires them to take on the challenge of learning a new programming language. Or at least entertain the idea until they run into a road block.

A number of fellow travelers do a noble job instructing people on the basics during NICAR’s annual seminars. But scraping seems like such a sought-after skill that it feels like a good idea to throw up a basic walkthrough here, where beginners can cut and paste code and any feedback can be memorialized.

But before we get going, let me just say that I’m going to assume you read the first couple recipes and won’t be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I’ll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn’t work on your end, just shoot me an email or drop a comment. We’ll iron it out.

Formalities aside, here’s the example task I’ve selected to achieve our mission.

  1. Install the necessary Python modules, mechanize and Beautiful Soup.
  2. Train our computer to visit Ben’s list of The Greatest Albums in the History of 2007.
  3. Parse the html and scrape out Ben’s rankings.
  4. Click through to Ben’s list of The Greatest Albums in the History of 2006 and repeat the scrape.
  5. Do it all over again, but this time download the cover art.

1. Download the mechanize and Beautiful Soup modules. Install them.

There are a dozen different methods for going about our task, so you shouldn’t assume the one I’m about to show you is the only or the best. It’s just one way to do it. And doing it this way requires a couple additions to your Python installation, which might seem a little daunting but should be doable unless IT has your computer on double secret probation.

A module is a collection of functions, defintions and statements contained in a separate file that you can import into your script. Examples native to Python used in our earlier scripts included “re”, “os” and “string.”

Out there on the Web, kind and ambitious programmers are constantly drafting, updating and publishing new modules to boil down complicated tasks into simpler forms. It it wasn’t for these people, praise be upon them, I probably wouldn’t have a job.

If you want to take advantage of their contributions, you need to plug their creations into your local Python installation. It’s usually not that hard, even on Windows!

To accomplish today’s task, we’re going to rely on two third-party modules. The first is mechanize, a Python translation of the popular Perl module for calling up and walking through Web pages. The second is Beautiful Soup, a superlatively elegant means for parsing HTML and XML documents. Working hand-in-hand, they can accomplish most simple web scrapes.

If you’re working Linux or Mac OS X, this is going to be a piece of cake. All you need is to use Python’s auto-installer Easy Install to issue the following commands:

sudo easy_install mechanize
sudo easy_install BeautifulSoup

And now you can check if the modules are available for use by cracking open your python interpreter…

python

…and attempting to import the new modules…

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

If the interpreter accepts the commands and kicks down the next line without an error, you know you’re okay. If it throws an error, you know something is off.

I don’t have a lot of Python experience working in Windows, but the method for adding modules that I’ve had success with is simply downloading the .py files to my desktop and dumping them in the “lib” folder of my Python installation. If, like me, you use Activestate’s ActivePython distribution for Windows, it should be easily found at C:/Python25/lib/. And when you browse around the directory, you should already see os.py, re.py and other modules we’re already familar with. So just visit the mechanize and Beautiful Soup homepages and retrieve the latest download. Dump the .py files in your lib folder and now you should be able to fire up your python interpreter just the same as above and introduce yourself to our new friends.

With that out of the way, we now have all the tools we need to grip and rip. So let’s do it!

2. Open the command line, create a working directory, move there.

We’re going to start the same way we did in the first three lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-scrape-and-download
cd py-scrape-and-download/

The commands should work just as easily in Mac as in Linux. If you’re working in Windows, you’ll be on the “C:/” file structure, rather than the Unix-style structure above. So you might “mkdir” a new working directory in your “C:/TEMP” folder or wherever else you’d like to work. Or just make a folder wherever through Windows Explorer and “cd” there after the fact through the command line.

3. Create our python script in the text editor of your choice.

vim py-scrape-and-download.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you’re a newbie Windows user, Notepad should work great.

If you’re following along in vim, you’ll need to enter “insert mode” so you can start entering text. Do that by hitting:

i

4. Write the code!

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)
 
html = page.read()
soup = BeautifulSoup(html)
 
print soup.prettify()

Our first snippet of code, seen above, shows a basic introduction to each of our new modules.

After they’ve been imported in lines two and three, we put mechanize’s browser to use right away, storing it a variable I’ve decided to call mech, but which you could call anything you wanted (ex. browser, br, ie, whatever). We then use its open() method to grab the location of our first scrape target, my favorite albums of 2007, and store that in another variable we’ll call page.

That’s enough to go out on the web and grab the page, now we need to tell Python what to do with it. Mechanize’s read() method will return all of the HTML in the page, which we store, simply, in an variable called html and then pass to BeautifulSoup’s default method so it can be prepared for processing.

The reason we need to pass the page to Beautiful Soup is that there is a ton of HTML code in the page we don’t want. Our ultimate goal isn’t to print out the complete page source. We don’t want all the junky td and img and body tags. We want to free the data from the HTML by printing it out in a machine readable format we can repurpose for our own needs. In the next step we’ll ask Beautiful Soup to step through the code and pull out only the good parts, but here in the first iteration we’ll pause with just printing out the complete page code using a fun Beautiful Soup method called prettify(). It will spit out the HTML in a well-formed format. To take a look, save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line…

python py-scrape-and-download.py

And you should see something like….

<html>
 <head>
  <title>
   According to Ben...
  </title>
 </head>
 <body>
  <h2>
   The 10 Greatest Albums in the History of 2007
  </h2>
  <table padding="1" width="60%" border="1" style="text-align:center;">
   <tr style="font-weight:bold">
    <td>
     Rank
    </td>
    <td>
     Artist
    </td>
...

…which means that you’ve successfully retrieved and printed out our first target. Now let’s move on to scraping the data out from the HTML.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
 
mech = Browser()
 
url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)
 
html = page.read()
soup = BeautifulSoup(html)
 
table = soup.find("table", border=1)
 
for row in table.findAll('tr')[1:]:
    col = row.findAll('td')
 
    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']
 
    record = (rank, artist, album, cover_link)
    print "|".join(record)

The second version of our script, seen above, removes the prettify() command that concluded version one and replaces it with the Beautiful Soup code necessary to parse the rankings from the page.

When you’re scraping a real target out there on the wild Web, the mechanize part of the script is likely to remain pretty much the same, but the Beautiful Soup portion that pulls the data from the page is going to have change each time, tailored to work with however your target HTML is structured.

So your job as the scraper is to inspect your target table and figure out how you can get Beautiful Soup to hone in on the elements you want to harvest. I like to do this using the Firefox plugin Firebug, which allows you to right-click and, by choosing the “Inspect Element” option, have the browser pull up and highlight the HTML underlying any portion of the page. But all that’s really necessary is that you take a look at the page’s source code.

Since most HTML pages you’ll be targeting, including my sample site, will include more than one set of table tags, you often have to find something unique about the table you’re after. This is necessary so that Beautiful Soup knows how to zoom in on that section of the code you’re after and ignore all the flotsam around it.

If you look closely at this particular page, you’ll note that while both table tags have the same width value, an easy way to distinguish them is that they have different border values…

<table width="60%" border="1" style="text-align: center;" padding="1">
...
<table width="60%" border="0">

…and the one we want to harvest has a border value of one. That’s why the first Beautiful Soup command seen in the snippet above uses the find() method to capture the table with that characteristic.

table = soup.find("table", border=1)

Once that’s been accomplished, the new table variable is immediately put to use in a loop that is designed to step through each row and pull out the data we want.

for row in table.findAll('tr')[1:]:

It uses Beautiful Soup’s findAll() method to put all of the tr tags (which is the HTML equivalent of a row) into a list. The [1:] modifier at the end instructs the loop to skip the first item, which, from looking at the page, we can tell is an unneeded header line.

Then, after the loop is set up on the tr tags, we set up another list that will grab all of the td tags (the HTML equivalent of a column) from each row.

    col = row.findAll('td')

Now pulling out the data is simply a matter of figuring out which order we can expect the data to appear in each row and pulling the corresponding values from the list. Since we expect rank, artist, album and cover to appear in each row from left to right, the first element of the col variable (col[0]) can always be expected to be the rank and the last element (col[3]) can always be expected to be the cover. So we create a new set of values to retrieve each, with some Beautiful Soup specific objects tacked on the end to grab only the bits we want.

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

The “.string” object will return the text within the target tag (similar to javascript’s innerHTML method). But in the case of something like the cover art, which is an image tag, not a string value, we can step down to the next tag nested within the td column — img — and access its source attribute by tacking on ['src']. This would work just the same for a hyperlink (.a['href']) or any other attibute. And if you’ve got multiple layers of nested tags, you can simply step down through them with a linked set of objects. For example, “b.a.string” would retrieve the string within a link within a bold tag. There’s great documentation on these and other Beautiful Soup tricks here.

After we’ve wrangled out the data we want from the HTML, the only challenge remaining is to print it out. I accomplish that above by loading the column values into a list called record and printing it out use a trick that will print them with a pipe-delimiter using the .join method.

    record = (rank, artist, album, cover_link)
    print "|".join(record)

Phew. That’s a lot of explaining. I hope it made sense. I’m happy to clarify or elaborate on any of it. But if you save the snippet above and run it. You should get a simple print out of the data that looks something like this:

10|LCD Soundsystem|Sound of Silver|http://www.palewire.com/scrape/albums/covers/sound%20of%20silver.jpg
9|Ulrich Schnauss|Goodbye|http://www.palewire.com/scrape/albums/covers/goodbye.jpg
8|The Clientele|God Save The Clientele|http://www.palewire.com/scrape/albums/covers/god%20save%20the%20clientele.jpg
7|The Modernist|Collectors Series Pt. 1: Popular Songs|http://www.palewire.com/scrape/albums/covers/collectors%20series.jpg
6|Bebel Gilberto|Momento|http://www.palewire.com/scrape/albums/covers/memento.jpg
5|Various Artists|Jay Deelicious: 1995-1998|http://www.palewire.com/scrape/albums/covers/jaydeelicious.jpg
4|Lindstrom and Prins Thomas|BBC Essential Mix|http://www.palewire.com/scrape/albums/covers/lindstrom%20prins%20thomas.jpg
3|Go Home Productions|This Was Pop|http://www.palewire.com/scrape/albums/covers/this%20was%20pop.jpg
2|Apparat|Walls|http://www.palewire.com/scrape/albums/covers/walls.jpg
1|Caribou|Andorra|http://www.palewire.com/scrape/albums/covers/andorra.jpg

See the difference?! Pretty cool, right?

But, really, you could of done that with copy and paste. Or, if you’re slick, maybe even Excel’s Web Query.

As with our previous recipes