Ads

Monday, October 15, 2007

Congress Watcher

I often go to Thomas to see what congress is up to. I typically go to the yesterday in congress and read through the bill summaries.

I don't like the interface.

So I made a new one.

It consists of a bit of python code that parses the yesterday in congress page. The output from that is run on my server using PHP. I would be interested in finding out a better way to do this if anyone thinks it was a sloppy kludge.

The php code (congresswatcher.php)

And the python code (congresswatcher.py):

#!/usr/bin/python

##### ######
# Usage: ./congress_watcher.py <number of days ago> > congress_watcher.html #
# Use zero days to get yesterday's congress #
##### ######

import urllib2 #for website grabbing/reading
import re #for regexp stuff
import csv # for exporting the data
import time # for the sleep()

import datetime #for datetime... duh!
import sys # for the arguements

daysago = sys.argv[1]

timez = datetime.datetime.today() - datetime.timedelta(int(daysago))

yestz = timez - datetime.timedelta(1)

todaydate = ''

todaydate += str(timez.year)
todaydate += '0'*(timez.month < 10) + str(timez.month)

todaydate += '0'*(timez.day < 10) + str(timez.day)

yestdate = ''
yestdate += str(yestz.year)
yestdate += '0'*(yestz.month < 10) + str(yestz.month)

yestdate += '0'*(yestz.day < 10) + str(yestz.day)

base_website = 'http://thomas.loc.gov/'
base_website += 'cgi-bin/bdquery?&Db=d110&querybd=@OR(@FIELD(FLD961+'

base_website += yestdate
base_website += ')+@FIELD(FLD010+'
base_website += yestdate
base_website += ')+@FIELD(FLD961+'

base_website += todaydate
base_website += ')+@FIELD(FLD010+'
base_website += todaydate
base_website += '))'

try:
website = urllib2.urlopen(base_website)

except urllib2.HTTPError, e:
print 'Cannot retrieve URL: HTTP Error Code', e.code

except urllib2.URLError, e:
print 'Cannot retrieve URL: ' + e.reason[1]

website_html = website.read()

#output the header stuff for the html
print '<html>'

print '<body>'
print '<h1>Congress Watcher</h1><br><br>'

#check if they actually had congress
lazy = ''
lazy = re.findall(r'<hr><B>No formal action was',website_html)

if (len(lazy)>0):
print 'No Congress yesterday'
print '</body>'

print '</html>'
exit

dataz = re.findall(r'</A>: ["A-Z a-z,-.0-9()]+', website_html)

numz = re.findall(r'>[A-z. 0-9]+</A>:', website_html)

basebill = 'http://thomas.loc.gov/cgi-bin/query/z?c110:'

print '<table>'

for ii in range(len(dataz)):

numz[ii] = numz[ii][1:] #remove the '>'

numz[ii] = numz[ii][:-6] #remove the ' </A>: '

dataz[ii] = dataz[ii][7:]

billlink = '<a href="' + basebill + numz[ii] + ':">'

billlink += numz[ii] + '</a>'

print '<tr><td>' + billlink + '</td>'

print '<td>' + dataz[ii] + '</td>' + '</tr>'

print '<tr><td><br></td></tr>'
print

print '</table>'
print '</body>'
print '</head>'

2 comments:

Anonymous said...

Can you post before and after sample pages without violating anyone's copyright? In that case, screenshots then? Otherwise, this looks like good argument for programming languages as the ultimate user interface.

In other news, mod_python would have let you do this without touching php at all. Sure, someone will make the argument 'it doesn't really matter for something so small,' but the point is webpages and applications can and ought to be built with Python.

sstc said...

I made a post with the screenshots.

I think the proper way (with the code I wrote) to do this was to use CGI, because that just redirects the stdout to the webpage if I am not mistaken.

I would have to change all of my print statements to use mod_python.

I am going to keep that in mind for the next project, or if I update this. I did like the ease in debugging because I could output to the console the html, and see what was up without a browser.

Maybe I'll clean it up, add some google ads, submit it to digg, and see how long my ubuntubox lasts before it melts. (533MHtz vintage)

I'd imagine someone would be interested in reading what congress is up too, in a headline format.