Saturday, January 31, 2009

Represent (The Open Blog at the New York Times)

Represent (from Open: All the news that's fit to printf() )

More than eight million people live in the five boroughs of New York City. They have more than 150 elected legislative representatives, from the local level to Congress. Keeping tabs on the people who represent you is a difficult task.
Represent

That’s the idea behind Represent, an interactive feature we launched in beta last week. Using your address as a starting point, Represent figures out which political districts you live in and who represents you at different levels of government. It draws maps that show how where you live fits into the political geography of the city. And using information collected from around the Web, it presents a customized activity stream that tracks what the people who represent you are doing.


Hat tip to Paul Ramsey (The Clever Elephant) for pointing to this.

I've been teach myself the use of Geographic Information Systems (GIS). Specifically, the use of spatially enabled database. The general idea is take a dataset that includes locations (such as addresses) and display and analyze the spatially. The display part is actually pretty easy. From a list of addresses there are many tools to get a longitude and latitude, then create an ESRI shapefile. I've been using uDig to view shapefiles.

But the real goal is to do analysis. You could link a GIS viewer with a database to view items based on some parameter. uDig, QGIS, and ArcView all have this capability. But the real way to do this is to connect this to a spatial database so you can use all of your standard reporting tools. And you could produce reports based on any set of criteria (such as buffers or overlays) Commercially, the provider of GIS tools is ESRI with their ArcGIS family of systems. And if you want to do more then just look at the data, there are a very few database management systems that could handle spatial data. Oracle, Informix, DB2, SQL Server (I think the very latest version has spatial capability), and PostGIS. For the commercial systems, they don't come fully spatially enabled out of the box, you have to buy the more expensive versions. PostGIS is the exception. It is freely distributable. Which is important if you are just exploring and don't have money. Or you are looking at applications that are for someone who does not have money.

Anyway, it turns out the technology team at the New York Times has been doing a variety of projects with the website, using technology to provide many different visualizations of the information (not just stories) in the New York Times. Some of the good ones are the Faces of the Dead project, which provides stories about individual servicemen and women killed in Iraq, and the Election Results Map, which graphically displays the election results in a variety of different ways, which tells a story much differently then you can using only words.



The Represent project is another project. It takes an address, then determines what districts you live in (local, state federal) and who represents you. Then it goes and searches the New York Times, and a variety of data sources to let you know what they have been up to, how they have voted recently, and what is coming up (in their public schedule. And the tools they use was driven by the fact they needed to use an open source (freely available) spatial database, which means PostGIS. (They use GeoDjango web framework to build the website, which includes extensions to Django web framework that allow the Object Relational Mapper (ORM) to work with spatial data fields.

Why is this cool? Well, it is what some call a mash-up, it is a way to re-purpose and reuse data, in this case news articles. And in the case of the New York Times, create something that cannot exist anywhere else, to deliver timely, local information. Just what a newspaper is supposed to do.

Friday, January 23, 2009

PSO: American composers and a new day

(also can be found at http://pittsburghsymphony.blogs.com/outside/)

This weekend's concert featured Gabriela Montero, fresh off a little Washington DC performance of Air and Simple Gifts in a small venue of 2 million people and a worldwide TV audience. It promised to be something interesting.

Bookended by Berber and Mendelson, Ms. Montero played Rhapsody in Blue by Gershwin and the live performance premier of Air and Simple Gifts by John Williams.  Rhapsody in Blue is probably best known for being played in commercials. Air and Simple Gifts, for its place in the recent 2009 U.S. Presidential Inauguration. For both, Ms. Montero made hearing these pieces live a treat. If your concept of Rhapsody in Blue is the driving theme of a TV commercial, the PSO performance presented many sides to the theme. From the hard driving we're used to hearing, to a light dancing on piano, to strings and horns playing off each other, and winds taking the theme exposed on their own. There is even room for a banjo. And listening to a piano improvise her parts is a treat, especially when someone like Ms. Montero, whose a skilled improvisationist, is the one on stage.

Air and Simple Gifts is also transformed in a live performance. While on TV watching the inauguration it sounded like a straight copy of the Copland Appalachian Spring, live shows many more layers that seem like mere rumbling on the televised version. You can hear the counterpoint in the strings as clarinet and piano take on the main Simple Gifts theme. The piano part becomes an integral weave throughout the piece instead of a murmur under the strings and wind.

I was amused at the program's description of George Gershwin in relation to his contemporary classical composers. It describes Gershwin as a jazz composer dabbling in classical composition. Certainly he is better known for his jazz and folk-style music, along with his many musicals. But this description also fits another composer, who is better known for writing music that is played in a popular culture setting, whose tunes also embed themselves in popular consciousness, setting the theme for a work, or even an event. John Williams, composer for film, and Presidential inaugurals. The comparison is greater when one looks at the criticisms of the piece in various newspapers (not about the playing, but the piece itself). Similar to Gershwin, John Williams is best known for writing music for a popular setting, in his case movie music. His music sticks and conjures up strong images
in your mind. What images do I here in Air and Simple Gifts? There is a quiet beginning of a new day, then waking up for the work to do with the Appalachian Spring theme with the clarinet. And the rest of the piece giving that sense that there is much work to be done by all, good work, but lots of it. And for this piece, at this time, that sounds about right.

A concert with modern American pieces just fits this week, with an inauguration just behind us. And it seemed to show the contrast between a brasher, more cocky America of the not-too-distant past, and one that has many challenges ahead in a new day. It made for a worthy performance. Of course, there was the also timely improvisational encore at the end of the first half, when Ms. Montero took requests for a theme to play off of. And what followed was many spirited variations on 'Here we go Steelers, here we go!' Very timely indeed.

Thursday, January 22, 2009

Massaging text with Vim

I'm teaching a database design class this semester. And while I could do most of it using a database I already know (and I have been doing the samples so far using SQLite. I'm sure just about all the students are using MS Access), I'm also taking the opportunity to learn PostgreSQL. (I'm using PostgreSQL because I have a project that will require a spatial database at some point. And the only real options are Oracle and PostGIS, an extension of PostgreSQL. Since funding is at a premium, PostGIS it is.) PostgreSQL is a bit of a step for me, because it is a server-based database, and requires administration. So stuff to learn.

The other thing I need to learn is how to get data into it. MS Access, SQLite and Derby all have mechanisms for pulling in your average CSV file. But the general way that databases are populated (and backups made) is via SQL files. But data does not come in SQL files, they come in delimited text. So the way to do this is take a text file, then build SQL around the data.

The procedure is straight forward. After the tables are created in the database (using CREATE TABLE statements) you just read in a text file (usually with a .sql extension just so you can recognize it when you see it again). And the text file is not hard either. It starts out as
INSERT INTO tablename(fieldname1, fieldname2) VALUES ('value1', 'value2');

Now, it is not too hard to convert the delimited text into the 'value1', 'value2' format using a bunch of search and replace operations (and accounting for things like escaped "'" and the like). The trick is getting the rest of the text around it. It could be as simple as a cut and paste. Multiple pastes. But some of these files had 100000s of lines. That is a lot of [CTRL]-[v]

Like a lot of people whose fortune is it to program, I've used a number of text editors to write code. Microsoft Visual Studio, Eclipse, JEdit are my current set. But one editor that I've always kept around is Vim. And it is for things like this I do so.

So, I have something I need to do for a few 100000 times. One way to do this would be to write a script. Not too hard, read a line of text, use it to build a new line of text with all the SQL in the proper place. Write a file. But it is a bit of overhead with having the interpreter running and managing the file handles, etc.

But another way to do this is just write a macro or script inside the editor. But this would not be fun in Visual Basic. Vim, like MS Word, has a macro recorder. Basically, it records a bunch of key strokes and allows you to replay them however many times you need to. So I start by doing one line. Then I put the INSERT INTO tablename(fieldname1, fieldname2) VALUES ( on its own line at the beginning of thie file (before the first line of data). "ayy puts this into a memory register 'a' within Vim (not the system clipboard. I tried this, but it is much slower.) Now the magic starts.

q1 (starts recording in register '1')
J (joins the following line with the current line)
[CTRL]-[a] (moves cursor to the end of the line and switches to insert text mode)
); (finishes off the line of SQL)
[ESC] (ends insert text mode)
j (moves down one line)
^ (goes to beginning of current line)
"aP" (inserts contents of register 'a')
q (stops recording the macro)

Now, to use this, I test this a couple of times to make sure it does what I want.


@1 (replay macro '1')

Looks good.

100@1 (macro '1' 100 times)

Or

500000@1 (macro '1' 500000 times)

and go to bed.

Saturday, January 17, 2009

PSO: John Adams looking through history

I have not been to a concert in almost a month. It almost seems like going through withdrawal. And soon some reminders started coming. My friend Mimi let me know that she was playing in the pre-concert program. And my reply was, of course, I'll be there.

Erhu at Heinz Hall

AppalAsia is one of those experiments that happens when musicians from different backgrounds decide to try something different. In this case traditional chinese instruments along with Appalachian folk instruments. Mimi played the Erhu (chinese instrument with bow drawn across strings), Jeffwas on the lap dulcimer and Sue played banjo. What came out was something like jazz, where the players listen to each other and adapt. Each hearing the others play, reacting to the quality of the music each was producing and adapting their own playing to it. It would be unfair to call the music chinese, or Appalachian, it was a hybrid of the two, done in the traditions of jazz.

AppalAsia playing at the PSO

Prior to performing each piece, composer/conductor John Adams introduced the pieces. There were two pieces, selections from the opera Nixon in China, and the Dr Atomic Symphony. What does it mean for a composer to introduce his own work? We get to hear why the piece was written, what he was thinking when he wrote the piece, and the flow of the piece. So, for Nixon in China we know the attitudes of the principle parts, the setting of the Nixon's arriving and being greeted, how Nixon is not paying attention while in the greeting line, of Mao reminiscing, of Chiang's imperiousness, and Chou's introspection.

Before Dr Atomic, John Adams introduction gave us something more, he let us know what voices various parts of the orchestra were taking on. The aggressive General, the Tewa Indian maid, the panic of the storm, and Dr. Oppenheimer's anguish over what is being unleashed on the world.

Why the story? Most of the pieces in the classical canon are not so tied to recent history, even the recent history of when they were written. In John Adams case, not just with these pieces, but with topics such as terrorist attacks on cruise ships and the 2001 World Trade Center attacks Adams directly referred to the potential reaction to these sometimes raw topics in his introduction to "Nixon in China" as being an "opera for Republicans and Communists." But he also said during his introduction to Dr. Atomic Symphony was that he writes about the history he has lived, and these were the big events, the cold war with its specter of atomic death and the rise of terrorism, the interconnectedness of globalization, the rise of America's impact on the world, and the times of transition we are in now as others have grown in the world America had nurtured.

I wonder if it is an American trait to do this sort of introspection. In Nixon in China the third act has the Nixon's talking about the most mundane aspects of being witness to history, while Mao and Chiang focus on how they overcame hardship in their struggle in the birth of Communist China. And our artists write works such as this, casting the mighty and powerful as dealing with the mundane, a contrast to the recent grandour of the presentation of the last Summer Olympics in Beijing. (not that there are not over the top representations of America, but the American desire to look at the greatest of figures as mundane and earthy seems different.)

My wife has remarked that living life together is like living a history book, touched by war, natural disasters, the thrill of a political campaign. And it is history from a mundane level, not the glorious stories of heroes (although there are many such), but people, responding in the positions they are in, in the good and the bad. Like John Adams, in my own little way, I appreciate the opportunities to remember and record the history around us. And maybe tell stories in years to come.

Friday, January 09, 2009

Data Analysts Captivated by R's power: New York Times


This article is about the open source statistical environment, R (http://r-project.org).  It is an implementation of the S language created by John Chambers for the purpose of data analysis.  There is another implementation, the commercial package S-Plus, generally considered to be number three among statistics packages (after SAS and SPSS).  

The fact that it is open source means one thing, that people can inspect the source code, determine its correctness, and add to it to improve the implementation.  In practice, R has become a playground for statistics researchers.  Because they can inspect the inner workings of R, they know every little step being done, and they can make it do exactly what they want.  Generally, people won't change what is already there without very good reason, that is justified to the maintainers.  And numerous test suites exist that ensure the integrity of the code (at least the core parts).  Many packages exist for R, in many cases written by academics who release the package to go along with papers they have published and books they have written.  These are also heavily tested, as the authors stake their academic reputation on these packages (and the distribution of the packages makes it much easier to sell their books and encourages people to read and cite their journal articles because it is easier to use the methods when there is already software readily accessible.).

One thing that open source software also attracts are critics.  Especially when there is a commercial competitor.  The article quotes people from The SAS Institute, the top statistical package around.  One quote is:

SAS says it noticed R's rising popularity at universities, despite educational discounts on its own software, but it dismisses the technology as being of interest to a limited set of people working on very hard tasks. 

"I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS.   She adds, "We have customers who build engines for aircraft.  I am happy they are not using freeware when I get on a jet."

In this case, both paragraphs mislead the issue.  What open source software attracts are people who (1) need to know exactly what the software is doing and (2) have needs that were not apparent to the writers of the software.

In the Ms. Milley's comment, the question is, do the people who build engines for aircraft know more about what the software should be doing, or does SAS know more about the software should be doing.  If the engine designers know what the software should be doing, maybe they should be the ones writing it (and testing and validating and verifying).  The SAS code is a black box, to people who are very smart and do not need or want a black box.  In particular, the acceptability of numeric code should not depend on how much you paid for it, but on the testing and validation done.  And this is done through inspection of the code or a test suite.  R's code and its packages can be inspected.  SAS's code cannot.   And the test suite may very well be freely available too.  Because the researchers who initially developed the methodology had to prove its correctness in the open to the academic community who peer reviewed the work when it was first done.  And nowadays, that work was first done in R.

The second group are people who know more about the subject matter then the commercial software builders.  First a digression.  Most of the people who have written code for R are academics and researchers in statistics (academic or corporate).  One of the obvious contrasts is S-Plus, the commercial implementation of S.  Most of the programmers who write S-Plus have backgrounds in computer programming.  So the result is S-Plus is generally regarded as faster and makes better use of resources.  But methodology gets developed in R first and the methods are more correct, because the subject matter experts actually wrote the code.  And this is true across the board.  Many niche areas, the methodology is written in R, because the subject matter is too small for the mainstream statistical programmers at SAS to put their time into.  And this is in addition to the "high-end" uses that SAS has.  Because the people at SAS don't have time to learn the nuances to every use that requires statistical environments.  Or the subtleties of every application.  They have to program to the mean, and to people who only want a black box that only spits back numbers.

What is the biggest obstacle to open source software acceptance in numeric uses?  The requirement for certifications.  SAS has the money to certify their product for use in regulatory purposes.  What remains is the need to certify the environment around the statistical package, which companies that are involved in regulated activities must then do.  And some of the work for those who use R has been done as well in the document R: Regulatory Compliance and Validation Issues - A Guidance Document for the Use of R in Regulated Clinical Trial Environments http://www.r-project.org/doc/R-FDA.pdf.

My own involvement?  I once wrote some code in an R project, R-GLPK.  It provided documentation and examples for the use of a linear programming package from within R.  Why?  Because it was conceivable that people performing data analysis would, in the midst of the analysis, use linear programming to produce intermediate results.  Or that an analysis may use intermediate results as inputs into a linear program.  Or that R may just happen to be the platform other work was done an now someone would need to solve a linear program. Or any of a multitude of things.  That you don't go to SAS for.  So an open source package that connects to something that does something well just made things that much easier.

Thursday, January 01, 2009

Latodami Nature Center, North Park, PA January 1, 2009

2 Downy Woodpecker
1 Hairy Woodpecker
3 American Crow
2 Black-capped Chickadee
2 Tufted Titmouse
3 White-breasted Nuthatch
2 Song Sparrow
1 White-crowned Sparrow
3 American Goldfinch