Monday, September 15, 2008

stackoverflow, OR: designing a database for search engines

As we know, Google, MS, Yahoo and a number of search startups are places that CS, math and physics majors from MIT, Stanford, Berkeley, CMU, UIUC want to work at. It used to be Goldman and Lehman were *the* places. In the late 80's, i programmed ginormous cashflow models for Merrill Lynch's mortgage and asset-backed securities, tens of thousands of lines of C, fortran and APL. I loved the job, tho i remember being called "lazy" and "minimal bonus" a few times because i was exhausted and wanted to go home at 11 PM on a Friday. When i got tired, i used to think of the associates at Skadden and Cravath, who only seem to get home to sleep 5 or 6 days a week. They're like Michael Phelps with redlined reps and warranties.

Anyway, Google and startups have hired small armies (divisions, brigades?) of supersmart people to build math models to run markov chains, strip HTML tags, identify terms to receive greater weight, pair acronyms with full terms, tally common mis-spellings, gazzetteer proper nouns, and bust apart compound words so you don't have to. You want a thin-crust pizza in Guayaquil, or help putting both feet behind your head in yoga, the first page of google search results will probably get you something.

If you have problems getting a web app to work efficiently with Apache, or mySQL, or django, or whatever, things are a little different. I told you why. Programming is accessible to most people. My 11-year old niece can look at VB macros and intuit what they do. Natural Language Processing of plain-text docs is hard, Hal Daume's blog, one of most accessible blogs is, uh, really not very accessible unless understand linear algebra, statistics, diff eq's pretty well. So doing "semantic" indexing on enormous codebases in dozens of computer languages, and mailing lists / blogs in several natural languages (English, Chinese, Russian, Japanese, various European languages), that is really, really difficult.

That's where stackoverflow comes in. Joel and Atwood are 2 of the leading chroniclers of the practice of software dev, and they know all the web 2.0 hooks: not too dense a layout, tags, easy drill down with tabbed navigation , fast/functional search box, ... Geez, that sure looks like reddit's stylesheet, huh? But those are not breakthroughs, the breakthrough is this. They realized heaps of well-paid, well nourished PhDs making thousands of incremental improvements in crawlers and indexing techniques haven't made finding answers to PHP problems substantially easier. So they're designing a database of software dev info to be indexed.

Joel/Atwood set guidelines for how to ask questions: Don’t combine multiple answers. Then they open the question and each of the answers to be edited / improved/added onto, wiki style. There's no discussion / replies. There's answers. You can take existing answers and edit them, or roll up partial answers into new, better answers. And ultimately, like when wikipedia works, you should have a comprehensive /authoritative answer for your carefully articulated question.

This is pure genius. Now if i could find where i registered for openID logins, ...

No comments: