Warning: this post will either a) show that I’m obsessive/compulsive and must be stopped, b) prove that I’m a fool/moron and should be left to my own devices, or c) all of the above.
I’ve been reading David Allen’s Getting Things Done (GTD) with the hope of actually getting something done in 2009. One of the core concepts of GTD is having a trusted system in which you place data on all your projects, next actions, calendar data, tickler entries, etc. As a geek, I’d love my trusted system to be some electronic widget that can easily sync with my home and office Outlook, Google Calendars, and so on. But should that widget be something I combine with my phone—as in a Blackberry, smart phone, iPhone, or whatever—or should it be something separate as in a Pocket PC, PDA, or the like? Would a tablet PC or one of those tiny notebooks be a better choice?
Surprisingly, I’m finding that many tech-savvy GTDers are going to the low-tech moleskines, Hipster PDAs, and such. For one thing, there’s no boot time with these low-tech organizers. (Matthew Cornell has cataloged some discussion points on high-tech versus low-tech trusted systems here.) So, for the time being, I’m going low-tech.
The wife bought me one of those Franklin-Covey planners and I began loading it into the seven-ring binder when I encountered a dilemma: the paper planner includes 17 double-sided blank address pages. A little box in the upper right corner seems to shout, “write a letter in me!” But what letter do I write? If I write “A” on the first side of page 1, do I write “B” on the backside of page 1 or do I write “A” again and write “B” on the first side of page 2? How many sides should I allocate for surname letters I perceive to be more popular, like R, S, and T? To solve this dilemma, I thought I’d take a scientific approach to the problem: namely, google around to see if anyone has posted frequency data on surnames beginning with the various letters of the alphabet. At this point, I would just be happy with data relegated to the United States.
Alas, as far as I can find, no one has published such information. However, the United States Census Bureau has published surname frequency data from the 1990 Census (more info here). This report is a sampling of around 88,000 surnames ranked by popularity and percentage frequency in the sample population. Hmm. How could I coalesce this information to show the percent frequency of surnames that begin with the various letters of the English alphabet? I know, PowerShell (if you haven’t noticed, I’ve been on a bit of a PowerShell rant of late)! Actually, the solution I came up with combined PowerShell and LogParser.
I downloaded the report and found that it was persisted in a fixed width format. Unlike Comma Separated Value (CSV) files, PowerShell doesn’t seem to deal well with fixed width data files. MOW did some intense work here, but that looks like a lot of work, so I just borrowed this idea to convert the fixed width file to a CSV file and then process the results:
With the CSV file in hand, I could then write my LogParser query to perform my calculations. I would have liked to use Linq for my queries and thus keep my solution purely within the domain of PowerShell, but I’m not quite sure how to use Linq in PowerShell—or if it’s even possible with version 1.0. That’s one more research item to add to my list.
Anyway, while constructing my LogParser query, I found something interesting: the percent frequencies of the surnames in the sample don’t add up to 100%. Rather, they add up to something like 147%. I wasn’t expecting that. So, I guess if I want to get an accurate percent frequency of a given surname (and then sum those percentages together for all surnames beginning with the same letter), I’d better divide the value by the true total, 147, and use those results. Of course, I don’t want to hard-code that total value—I want my query to calculate it for me.
To do this, normally I would nest a SELECT SUM(Percent) AS TotalValue within my larger SELECT query that produces the report I’m interested in, but LogParser seemed to have a problem with this kind of nesting—my guess is that LogParser doesn’t like two or more SELECTs together that are reading from the same file. Alright, I’ll perform two separate calls to LogParser: one to get my total percent value and the second to use this value to calculate the percentage of each letter in the sample population. One thing I learned by doing that first query is that I can actually capture the LogParser results back into a PowerShell variable:
|
$query1 = "SELECT SUM(Field2) FROM " + $newCSVFile
$totalPercent = & logparser $query1 -i:CSV -headerRow:OFF -q:ON
|
Now, with that value in hand, I can use PowerShell and LogParser to render the report I’m looking for (notice that I’ve included a calculation for recommended Address Page allotment into my query by calculating the percentage frequency by 34—the number of blank address pages):
|
$query2 = "SELECT SUBSTR(Field1, 0, 1) AS SurnameLtr, `
SUM(Field2) AS SummedFreqAmt, `
DIV(SummedFreqAmt, " + $totalPercent + ") AS PercentOfFreqAmt, `
ROUND(MUL(PercentOfFreqAmt, 34)) AS PagesToAllocate `
FROM " + $newCSVFile + " GROUP BY SUBSTR(Field1, 0, 1) ORDER BY SurnameLtr"
& logparser $query2 -i:CSV -headerRow:OFF -rtp:26
|
And my report in the PowerShell console:
And now I have a more reasoned way of allotting my blank address pages to particular letters of the alphabet without just guessing. Of course, the page allotments total 35—I only have 34 to dole out. And what happens if I meet someone with a last name starting in U or Z?
After I explained this work to my wife, she just said, “why don’t you keep the pages blank until you need them? If you end up needing more, you can always buy more.” Huh. Never thought of that. Guess that’s a more rational approach. Darn it.