Monday, September 15, 2008

Programming and search engines

Yes, Joel, search is hard. Despite the efforts of Guido, Hejlsberg, Armstrong, Gosling, Odersky, Wall, Matz, DHH, McCarthy and the Haskell guys to make development faster, and Larry and Sergei to plop all the world's data, there's a lotta blogs talking about digging through Google search results on one issue for 20 minutes, an hour, an afternoon, ...

Once in a while, somebody has an epiphany/ lightbulb/ out of the box moment: Use a smaller funnel: Yahoo search, delicious tags, Google custom search engine, a subreddit, artima.com, searchyc.com. A few folks *are* blessed with a central repository of info that's comprehensive, peer-reviewed, and well-indexed, like the PHP docs. And Joel and Atwood's stackoverflow, out of beta today, looks like a really beautiful approach to the problem.

Why is it so hard? There are several reasons. Basically it's the sheer volume of stuff, and problems in defining terms.

1. The sheer volume of mailing lists/forums/groups. How many comp.lang.__ lists were there 10 years ago? How many Google groups now? And how many mirrors of those groups (nabble, GMane, activestate, ...). How many ruby users groups' mailing lists are worth reading? I'd say at least 6: Utah, Seattle, Denver, Orlando, Columbus, San Diego, Wellington (NZ)...

These days, first thing a dev starting an open source project does is choose a name, second is start a Google group, third is open Textmate, type "#!/usr/bin/env perl". After a while people fork or submit patches or test suites, so open a second Google group for committers on the project.

2. Another source of mirrors: open source /FOSS / available source code, whatever. Each linux distro and BSD spawns a few mirrors for its repository of RPMs, tarballs, BSD ports, other formats. FreeBSD: I'd say at least a hundred mirrors. Each repo may contain hundreds of thousands of lines of code (millions? Note to self: do some fact checking), popular libs have lots of inbound links, i.e. PageRank. So they're dutifully picked up by the google and yahoo bots. (and then there's sourceforge, Google code, github, ...)

This is a reverse stopword problem. It's useful to index the module, method and class names of the builtin and standard library functions in, say, java project repositories. Everything else, programmer-defined class, method, variable names, are basically throwaways, or decoys. Google search engine seems to do this separation pretty well for their 4 core languages.

2.5 Good blogs are hard to maintain. Over the years a lot of good blogs have been brought down by spam and fly by night web hosts and domain registrars. A number of wordPress blogs were de-indexed by Google this past year because they were injected with malware. A non-trivial number of delicious tags for e.g. java point to Google's archives cause the blog isn't there anymore.

The converse/obverse: lots of valuable programming information resides in blogs that are web design / SEO horror shows: typos/grammos, fucking curse words, URL is mostly a 9 digit random number with a 15-char hex session id, mangled / poorly highlighted source code.

2.6 So how do you estimate the universe of blogs, wiki pages and mailing posts that are worth bookmarking /cataloging for, say, C# devs? Delicious lets you look at 20 pages of stuff for one tag. For javascript, that's less than one day's worth of tags. So, say, 2k-3k tags a day. Technorati has similar volumes, other sites, reddit, magnolia, furl, spurl, ... have less. Can you apply a metric to pick the best blog posts/tagged pages? Sure, just look at # tags/ month and PageRank. Well, actually, that lets you pick out the pages with the best SEO and are most easily understood. This has, again, no easy answer, or i would've put together a database of the 25k best blogs, wikis, mailing list posts on rails programming, and let you subscribe to it for, say, $45/hour. And then I'd be as rich as a Lehman partner.

3. unique terms: Software devs have some, but not quite like medicine or law, say, where absolute precision and recall drive the process. Metabolic processes and pathologies have Latin names for causes/symptoms. The U.S. legal system has procedures to determine which jurisdiction and body of law, criminal/civil, federal / state, case/common or legislated, is applicable , so you know where to point WestLaw or Lexis for your precedents. Python and ruby devs, in contrast, are a mixed group. Some are CS-trained devs doing shrink-wrap apps for sale, some are business analysts, sysadmins, or DBAs, some are physicists or mortgage-backed analysts. So some call it "AOP", others call them decorators, or method hooks, or callbacks, or method intercept, ...

There are now large communities of devs in China, India, Russia, South America, everywhere really, blogging and mail listing in their native tongues. Lots of times I'v seen Delicious tags of stuff that has interesting code and narrative that i can't understand. This is on the upswing, naturally.

The hardest thing to index on the web: blogs in mixed natural languages (i.e. not side by side translations) or mixed source code and natural language writing. How do you tokenize and stem these things?

A web dev has to google for "input form", "web form", "html form". Then ask for: "Never sent", "Doesn't work", "no response". Google for "php html form submit doesn't work", you get nearly 400K hits. Then you attack the issue for 2 of the 5 major browsers it doesn't work in... This is why good web devs and designers are hard to find. This is where they'll IM their network of problem solvers, search on specific individuals' posts to comp.lang.php, hit a carefully built google custom search engine. Or put up a pastie with the minimum code to replicate the problem and the failing test case, jump on IRC, problem solved. This is where the good dev sticks to disciplined, secure, maintainable process flow. They could just find a ajax library that works and stick it in.

No comments: