Tonight's mail brought a big envelope from the New York Review of Books, filled with stuff, asking us to subscribe at your standard concessionary rate.
There were two problems with this envelope:
It is true that I have once been seriously called "Brad Marie"--by Josephine Foehrenbach, who once said, "Who's coming? We have Jean Marie, Ann Marie, and Brad Marie..." But the name "Brad Marie" is in the New York Review of Books's database because once, long ago, my mother gave us a gift subscription and some harried data entry clerk settled on "Brad Marie". Our current NYRB subscription is under another, more normal name.
Are there no Perl hackers? Are there no databases to which the NYRB has access that would reveal that our address is a single-family house? Is there no concern with cleaning out the database? Is nobody willing to guess that a single-family house probably does not contain two households, especially if one of them is supposed to be named "Brad Marie"?
Posted by DeLong at June 8, 2004 09:12 PM | TrackBack | | Other weblogs commenting on this postThere is a market for everything (oh wait, wrong blog). In this market the cost of sending adverts to current subscribers doesn't exceed the cost of perl hackers and missed subscribers due to false drops.
My advice is change the cost structure, use the postage paid envelope to remind them you already subscribe.
Posted by: mawado on June 8, 2004 09:46 PMSingle family != single surname (oh here we go again).
Posted by: Andrew Cholakian on June 8, 2004 10:03 PMI'm with mawado. To put it more simply, the cost of hiring, and dealing with, perl hackers is only justified if the project makes/saves a lot of money, or if the project is interesting. Speaking as a perl/database hacker, I must suggest that deduping the NYRB subscription database is a boring project.
Posted by: Rich Gibson on June 8, 2004 10:15 PMFeh. It's cheaper to spam mail you. Good Perl hackers are expensive.
Posted by: Michael on June 8, 2004 10:27 PMOkay, here's the deal. These folks buy mailing lists from mailing list vendors. Mailing list vendors buy mailing lists from all sorts of folks, from banks to magazine subscription lists. Then the customer says, "I want 500,000 names from census blocks with an average family income of $80K or above." The vendor runs a gigantic selection sort (we're talking about big iron here) and out pops a list, for a pretty penny.
Now, there's two ways this list can be used: a) the mailing list vendor retains the list, gets the actual items to be mailed from the vendor (or, optionally, accepts the proofs from the vendor and handles the actual printing), pops the items through their own address printer hardware in pre-sorted fashion (postage is cheaper that way) into postal flats, takes the postal flats to the post office, voila. Or b) the mailing list vendor sells the actual data to the customer as an electronic file, and the customer then has to do all of the work above (printing the mailing in pre-sorted fashion into postal flats).
In general, option a) is much cheaper than doing it in-house, due to the economies of scale of the big mailing houses and because the big mailing houses don't like letting their actual data leave their premises. Last time I did a mass mailing for a non-profit, I used option a including letting the mailing house handle the printing (and they did a cheaper job of that than I could do too, they run so much volume through their preferred print houses that even with them tacking on a slight bit of overhead it was cheaper than what I could get on my own). So it is likely that the New York Review of Books never saw "Brad Marie" and "Brad"'s address on that envelope -- it was all done at the outsourced mail advertising house, which pulled it from some anonymous database (perhaps even the NYRB's own database, sold at some time to the past to that very same mailing list house and merged with millions of other addresses). At best the mailing list house did an elimination sort against the current NYRB database -- which does nothing about data that was merged into the mailing list house's database in the past.
In short, the New York Review of Books accepted the chance that some current subscribers could get mailings in order to reduce their costs by outsourcing mailings to potential customers. They could bring that function inhouse, but the duplicate mailings most likely cost far less than bringing all that inhouse. The equipment alone would cost far more than the $80K/year Perl hacker would cost... not to mention that it's unlikely that they would have the volume to get the volume discounts that the big mailing houses get from driving trucklots of mail slats up to their local postal sorting house.
All simple economics, really...
- Badtux the Once-led-a-nonprofit Penguin
This problem is called "Record Matching" by the Census bureau. It is a notoriously hard database problem, because databases are not set up to do approximate match algorithms.
The complexity of the basic problem is quadratic, a basic no-no in DB processing. The heuristics to make it linear or near-linear are pretty primitive, and hard to use in a db application.
Yes, there are commercial products that do it. But they are mostly either very special purpose, or just one piece of a large, expensive data integration product.
Now we know what to call you when we're mad at you, though.
"Brad Marie, you get in the house this instant!"
Posted by: Chris Marcil on June 8, 2004 10:59 PMAll the Perl hackers are out there thinking up witty answers to the question of whether "==" or "=" is the correct syntax in "$1 US 2004 == $.xx US 1980" . . .
Posted by: Michael on June 8, 2004 11:32 PMIs there any possible family structure not instantiated in the Bay Area?
Posted by: James on June 9, 2004 12:22 AMI can attest that record matching is, indeed, a VeryHardThing™. Had to do it once for a relatively small customer DB (12,000 names). We used Soundex and all sorts of allowances and margins and stretching to pick out probable matches, and after a twice-as-long-as-expected development period and innumerable test runs, ended up making a human look at and approve/disapprove every 'approximate' duplicate in the DB. Took a week. And then the buggers decided they didn't want the clean DB anyway. Grrr...
But yes, I would be willing to bet money that it's cheaper to just send out a few thousand wrong or duplicate mailings.
Posted by: cyclopatra on June 9, 2004 01:04 AM"This boy is Inadequate Privacy Policies. This girl is Poor CRM Tools."
Hmmm, no, just doesn't work.
Posted by: Ken C. on June 9, 2004 04:08 AMEven if the Census Bureau records your address as a single-family house, how can the NYRB know that "Brad Marie" isn't your tenant, au pair, grown son who's living with you until he finds a job, or some such?
Posted by: Seth Gordon on June 9, 2004 04:14 AMOn a related note, my wife thinks I'm weird because I take all the subscription cards that fall out my magazines and put them in the mail, just to make the jerks pay postage.
Amy I the only one doing this?
Posted by: Oberon on June 9, 2004 05:45 AMMichael,
The your whole statement is illegal in Perl. $1 is the frist sub match returned from a complex regx and would give the interpeter a fit. And with out context it is imposible to tell what your 'statement' is trying to do. But for the record = is an assignment and == is an equality test. This is the same in every C like lanugage I know of (C, C++, Java, JavaScript, Perl and C#)
HinderLands
Posted by: HinderLands on June 9, 2004 06:10 AMMy stepmother gets a lot of mail addressed to the fictional person of "Mr. Dallas D. Varga"
(Dallas being our family surname, D. being the initial for her first name, and Varga being her maiden name).
Presumably, the database got the order of her first, last, and maiden names shuffled around. Either a data entry error, or a programming error.
What really gets me though is why it's addressed to "Mister".
Anyway, I think it gets worse than "Brad Marie", Brad.
Posted by: Jim D on June 9, 2004 06:15 AMYou are on the dreaded NYRB mailing list. If you can figure out a use for the overstuffed offer, you'll be sitting pretty. (Let us know if you do.) Good luck!
Posted by: serial catowner on June 9, 2004 06:16 AMIf I were you I would devote time to tracking down the low down no-good so & so who named you Sue in the first place.
Posted by: bryan on June 9, 2004 06:21 AMI don't think Reason magazine would have this problem, since the last issue they sent me had an aerial photo of my house on it.
Posted by: digamma on June 9, 2004 06:49 AM"The complexity of the basic problem is quadratic, a basic no-no in DB processing. The heuristics to make it linear or near-linear are pretty primitive, and hard to use in a db application."
This statement doesn't make sense to me.
Theoretically, at least, I would see the problem as
for each record
calculate a canonical version of the record
hash the canonical version
look the hash up in a CAM (either a hash table or a tree)
if there's a hit, reject the address; if there's a miss insert the address into the CAM
Now the only tricky part here is forming a canonical version of the record. The address part of this is easy --- the US post office defines a canonical form for addresses, so just do what they do. The name part is the tricky part. One possibility is to say "screw it", and just assume one address == one potential subscriber. This strikes me as a pretty reasonable choice, since even in the few cases where multiple potential subscribers live at the same address (eg students living together), the reality is that most of the time if one of them subscribes, the others will just borrow the subscriber's copy.
If this is not possible, one will have to generate one's own version of an algorithm for converting a name to canonical form, but this does not have to be perfect, just to catch the common cases, and so again is not horribly difficult undertaking.
I just don't see why this has to be a hard, quadratic problem.
I have a feeling that "Brad Marie" is gonna regret telling us this story.
Posted by: Paul Callahan on June 9, 2004 09:15 AM"Brad Marie?" I remember reading about a transexual economist, but I thought the economist on question went from male to female, not the other eay around.
Posted by: C.J.Colucci on June 9, 2004 09:27 AM"Brad Marie?"
I remember reading about a transexual economist, but I thought the economist in question was someone else, who had gone from male to female, and not the other way around.
Or am I mistaken?
Well, it just so happens that I just got out of a meeting on this very topic. Now I work at what could be called a large financial institution so our data tends to be very complex and often very messy because of the fact that there's several systems feeding mailing lists, etc.
A magazine is a much less complex business, however the problems remain the same. Garbage in, garbage out. To scrub bad data is very difficult to do in reality using programming. Brad Marie is only one of a wide range of possible mutations of a name/address. Maynard's suggestion of deduping on address is a good common sense idea, but that does require a "canonical version of the record" for which out of the box solutions exist, at a pretty steep cost. The idea also makes some assumptions that most marketers don't necessarily agree with. The general opinion I find is that we don't know what the relationship is between Brad Marie and Brad DeLong and we don't want to make any assumptions and it's just possible that Brad Marie might want his own subscription so lets mail. I doubt Brad DeLong will mind so much. It's a pretty unscientific approach, really, but that's the way it works.
This example also highlights something about Garbage In-Garbage out and how once a piece of garbage data exists how difficult it is to get rid of. In a sense it becomes like cockroaches. It's pretty much impossible to erradicate it entirely as it gets duplicated and sold and passed around. It's stuff like this that makes me shudder to think about Total Information Awareness. Completely innocent people could get a bad data point out there somewhere inadvertently and the results could be terrible and difficult to reverse. And with everything that we've learned about the lunatics running the asylum, it's not a pleasant thought. Something to consider...
Posted by: Chibi on June 9, 2004 10:59 AMSpeaking based on the results of a colleague down the hall several years ago at a large telephone company. The software involved in finding errors in the billing data base was considerably more complex than what a Perl hacker would put together quickly. Testing was a large-scale effort, in order to make sure that the "cleaned" data base was going to be "better" than the starting data. Just defining "better" was a somewhat difficult undertaking. In total, the benefits of the cleaning had to pay for several person-months of effort.
Because we were dealing with errors in delivering some 20 million monthly bills, the results justified the effort. IIRC, cleaning a data base with only 2 million entries would have been a losing proposition.
Posted by: Michael Cain on June 9, 2004 11:43 AMprint <<"BAH_HUMBUG";
I don't send the subscriber cards back, since I don't subscribe to any magazine I dislike.
However, I do return the business-reply envelopes they send with credit card offers. I make sure not to add anything that might identify me, such as their original offer.
BAH_HUMBUG
Posted by: M. on June 9, 2004 06:57 PMHi, just popped in here through a random link. Cool site, keep this good work up :-)
casinos access
Posted by: casinos access on July 2, 2004 11:14 AM