West Virginia Can’t Spell Marijuana: 3 Open Source Tools To Search Government Data

West Virginia Can’t Spell Marijuana: 3 Open Source Tools To Search Government Data
West Virginia Can’t Spell Marijuana: 3 Open Source Tools To Search Government Data Flickr: hendry

The Sunlight Foundation and The University of Chicago’s Center for Data Science and Public Policy showcased three tools that can be used to sift through open government data at a SXSW panel: Open States, which tracks bills and voting records for legislators in all 50 states; the Legislative Influence Detector (LID), which checks how much of a bill has been plagiarized from another source; and DSSG’s Earmarks Tracker, which is a series of scripts that searches through congressional texts.



Kate Duffy, the Labs director at the Sunlight Foundation, presented Open States, the only one of three tools with an interactive GUI. With Open States, users can query a topic like “Syrian refugees” and see every single bill, accurate to the day prior, concerning it.

Open States gathers public information about committees, the elected officials serving on them, bills they've sponsored and the text in those bills. By taking this data and building open source tools, websites and APIs that manage it, Sunlight Foundation hopes to unlock the data so that anyone can use it, look at it and make it their own.

“You elect someone because you think they’ll represent your interests. You should be able to see if they’re continuing to further your causes, and you should be able see if they’re working for you or someone else, like a corporate lobbyist,” said Duffy. “You should be able to hold them accountable, whether it be with a tweet, a public records request, voting them out or an indictment.”

LID, on the other hand, presented by Matthew Burgess of DSSG, compares the text of bills with bills passed in other states and publically available templates lobbyists post online. By using the Smith-Waterman local-alignment algorithm and Lucene scores, LID finds the 100 most similar documents to the document being searched and then uses a word-based point system to compare the texts. For example, even though the West Virginia medical cannabis bills spelled marijuana as “marihuana,” LID was still able to understand that the text around the word was identical to bills from other states.



“LID decreases the amount of time it takes for people who are concerned with how lobbyists are influencing legislation, and drastically reduced the amount of time [it took to analysis the data],” Burgess said. “It used to take a watchdog groups weeks to analyze [manually]. [With LID], they can now do a mucher faster analysis in minutes.”

The tool takes advantage of the fact that legislators often copy-paste the text of bills from other sources, and since the information is public, it allows the influences that shaped the bill to be tracked. LID doesn't have an online interactive tool, but its dataset can be downloaded in a variety of formats. The tool’s code can be downloaded here.

In the same vein, earmarks tend to get buried in thousands of pages of bill text. DSSG’s earmark tracker, presented by DSSG’s Matthew Heston, doesn’t have an interactive tool online either, but it’s code can be downloaded here while the datasets can be downloaded here. Unfortunately, Congress declared a moratorium declared on earmarks in 2011.  This means that information about earmarks and the legislators using them to divert funds to their districts, has gone dark.

Join the Discussion