Three Is It

Because two isn't enough and four is just too many

If we come to read anything in Holy Scripture that is in keeping with the faith in which we are steeped, capable of several meanings, we must not by obstinately rushing in, so commit ourselves to any one of them that, when perhaps the truth is more thoroughly investigated, it rightly falls to the ground and we with it.
St. Augustine
Home Blogs Genealogy Brad's Bookshelf Subscriptions Contact Sign in
 

About the author

Brad Butts is a .NET developer and architect. He is married with children and enjoys reading, working out, and genealogy is his five minutes of spare time.
E-mail me Send mail
National Debt Clock

Recent comments

Authors

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2010

Of low-tech address books and surname distributions

address_pageWarning: this post will either a) show that I’m obsessive/compulsive and must be stopped, b) prove that I’m a fool/moron and should be left to my own devices, or c) all of the above.

I’ve been reading David Allen’s Getting Things Done (GTD) with the hope of actually getting something done in 2009.  One of the core concepts of GTD is having a trusted system in which you place data on all your projects, next actions, calendar data, tickler entries, etc.  As a geek, I’d love my trusted system to be some electronic widget that can easily sync with my home and office Outlook, Google Calendars, and so on.  But should that widget be something I combine with my phone—as in a Blackberry, smart phone, iPhone, or whatever—or should it be something separate as in a Pocket PC, PDA, or the like?  Would a tablet PC or one of those tiny notebooks be a better choice?

Surprisingly, I’m finding that many tech-savvy GTDers are going to the low-tech moleskines, Hipster PDAs, and such.  For one thing, there’s no boot time with these low-tech organizers.  (Matthew Cornell has cataloged some discussion points on high-tech versus low-tech trusted systems here.)  So, for the time being, I’m going low-tech.

The wife bought me one of those Franklin-Covey planners and I began loading it into the seven-ring binder when I encountered a dilemma: the paper planner includes 17 double-sided blank address pages.  A little box in the upper right corner seems to shout, “write a letter in me!”  But what letter do I write?  If I write “A” on the first side of page 1, do I write “B” on the backside of page 1 or do I write “A” again and write “B” on the first side of page 2?  How many sides should I allocate for surname letters I perceive to be more popular, like R, S, and T?  To solve this dilemma, I thought I’d take a scientific approach to the problem: namely, google around to see if anyone has posted frequency data on surnames beginning with the various letters of the alphabet.  At this point, I would just be happy with data relegated to the United States.

Alas, as far as I can find, no one has published such information.  However, the United States Census Bureau has published surname frequency data from the 1990 Census (more info here).  This report is a sampling of around 88,000 surnames ranked by popularity and percentage frequency in the sample population.  Hmm.  How could I coalesce this information to show the percent frequency of surnames that begin with the various letters of the English alphabet?  I know, PowerShell (if you haven’t noticed, I’ve been on a bit of a PowerShell rant of late)!  Actually, the solution I came up with combined PowerShell and LogParser

I downloaded the report and found that it was persisted in a fixed width format.  Unlike Comma Separated Value (CSV) files, PowerShell doesn’t seem to deal well with fixed width data files.  MOW did some intense work here, but that looks like a lot of work, so I just borrowed this idea to convert the fixed width file to a CSV file and then process the results:

 

 

 

 

 

#from http://www.eggheadcafe.com/software/aspnet/33254303/fixed-width-column-import.aspx
$fixedWidthFile = 'C:\data_files\personal\other\dist.all.last'
$newCSVFile = $fixedWidthFile + ".csv"

if(!(test-path $newCSVFile))
{
    $pat = '^(.{15})(.{7})(.{10})(.{2})$'
    $rep = '$1,$2,$3,$4'
    (gc $fixedWidthFile) -replace $pat, $rep -replace ' ' > $newCSVFile
}

With the CSV file in hand, I could then write my LogParser query to perform my calculations.  I would have liked to use Linq for my queries and thus keep my solution purely within the domain of PowerShell, but I’m not quite sure how to use Linq in PowerShell—or if it’s even possible with version 1.0.  That’s one more research item to add to my list.

Anyway, while constructing my LogParser query, I found something interesting: the percent frequencies of the surnames in the sample don’t add up to 100%.  Rather, they add up to something like 147%.  I wasn’t expecting that.  So, I guess if I want to get an accurate percent frequency of a given surname (and then sum those percentages together for all surnames beginning with the same letter), I’d better divide the value by the true total, 147, and use those results.  Of course, I don’t want to hard-code that total value—I want my query to calculate it for me.

To do this, normally I would nest a SELECT SUM(Percent) AS TotalValue within my larger SELECT query that produces the report I’m interested in, but LogParser seemed to have a problem with this kind of nesting—my guess is that LogParser doesn’t like two or more SELECTs together that are reading from the same file.  Alright, I’ll perform two separate calls to LogParser: one to get my total percent value and the second to use this value to calculate the percentage of each letter in the sample population.  One thing I learned by doing that first query is that I can actually capture the LogParser results back into a PowerShell variable:

$query1 = "SELECT SUM(Field2) FROM " + $newCSVFile
$totalPercent = & logparser $query1 -i:CSV -headerRow:OFF -q:ON

Now, with that value in hand, I can use PowerShell and LogParser to render the report I’m looking for (notice that I’ve included a calculation for recommended Address Page allotment into my query by calculating the percentage frequency by 34—the number of blank address pages):

$query2 = "SELECT SUBSTR(Field1, 0, 1) AS SurnameLtr, `
        SUM(Field2) AS SummedFreqAmt, `
        DIV(SummedFreqAmt, " + $totalPercent + ") AS PercentOfFreqAmt, `
        ROUND(MUL(PercentOfFreqAmt, 34)) AS PagesToAllocate `
    FROM " + $newCSVFile + " GROUP BY SUBSTR(Field1, 0, 1) ORDER BY SurnameLtr"

& logparser $query2 -i:CSV -headerRow:OFF -rtp:26

And my report in the PowerShell console:

surnameDist1

And now I have a more reasoned way of allotting my blank address pages to particular letters of the alphabet without just guessing.  Of course, the page allotments total 35—I only have 34 to dole out.  And what happens if I meet someone with a last name starting in U or Z?

After I explained this work to my wife, she just said, “why don’t you keep the pages blank until you need them?  If you end up needing more, you can always buy more.”  Huh.  Never thought of that.  Guess that’s a more rational approach.  Darn it.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Categories: Technology Blog
Posted by Brad on Sunday, January 04, 2009 9:26 PM
Permalink | Comments (2) | Post RSSRSS comment feed

Related posts

Comments

low cost franchise gb

Monday, May 11, 2009 2:42 PM

low cost franchise

hehehe good post ;) Like your blog too, considering changing over to blogengine from wordpress

Cheers

Matt

my net hubby us

Thursday, July 16, 2009 11:15 PM

my net hubby

This information I got in this site will surely make so handy to concern problem I wonder If I will tell my friends this site of yours,.,

Comments are closed