Sunday, June 21, 2009

don't be Slicehost's low hanging fruit

Well i've been reading all the blogs about setting up rails, django, Wordpress on Slicehost. Two phrases you commonly see
*I've never done linux admin OR: I'm not a hardcore system administrator*

*Now that we've finished securing our slice*

Whoa nelly, you're not *quite* done securing that doggy! I'm not accusing you of being one fo them CC-types, you know, "cargo ..." but here's some things you need to reasonably secure that thing

- domain registrar: conceal your home address, email, phone number in _whois_; There's at least one registrar that does this for free,

- SSH brute force attacks: set iptables to drop more than a few connections a minute (ifconfig: check NIC is "eth0")

- update the ruby interpreter from p-111:

- DENYHOSTS: read this guy's account of how many IP's he's blocked: hint: not dozens, no, not hundreds, either

- a few people can use TCP wrappers, having fixed IP addresses, but that's a relative rarity, I think, in the age of Comcast (yes that's what our era is)

- encrypt home directory. Excellent blog, Mr. Kirkland's

- strong password, not dictionary crackable: take the first letters of a sentence, mix upper and lowercase letters, append punctuation, numbers; OR do the mixed case thing on a dictionary word, split in the middle, insert punctuation, numbers;

- login ID not "demo" certainly, and not prone dictionary attacks. Ideally different user names for mysql and linux/SSH, but capistrano doesn't like this by default.

- which brings us to auditing capistrano, fabric, vlad, or whatever you're using to deploy from SVN/git. Make sure no login ID's and passwords are creeping in (rails' database.yml) to the slice in plaintext. You probably want to check for personal info: name, address, phone, IP addresses, whatever.

- heidi SQL: this is a common addon

- slice Manager: this is the big invitation for the crackers: "Hit Me, Kick Me!" I haven't seen any satisfactory answers, but all the more reason your login id and password shouldn't be dictionary-crackable.

- SElinux, appArmor, grsecurity (don't know how easily 1st 2 go with ubuntu)

- some good books. You need to read something about security. I spent few hours picking some good ones for you;

- blogs/wikis, too: Mr. Ellis has a good series

(truthfully, it was only a few minutes. OK, 45 seconds)

Monday, September 15, 2008

stackoverflow, OR: designing a database for search engines

As we know, Google, MS, Yahoo and a number of search startups are places that CS, math and physics majors from MIT, Stanford, Berkeley, CMU, UIUC want to work at. It used to be Goldman and Lehman were *the* places. In the late 80's, i programmed ginormous cashflow models for Merrill Lynch's mortgage and asset-backed securities, tens of thousands of lines of C, fortran and APL. I loved the job, tho i remember being called "lazy" and "minimal bonus" a few times because i was exhausted and wanted to go home at 11 PM on a Friday. When i got tired, i used to think of the associates at Skadden and Cravath, who only seem to get home to sleep 5 or 6 days a week. They're like Michael Phelps with redlined reps and warranties.

Anyway, Google and startups have hired small armies (divisions, brigades?) of supersmart people to build math models to run markov chains, strip HTML tags, identify terms to receive greater weight, pair acronyms with full terms, tally common mis-spellings, gazzetteer proper nouns, and bust apart compound words so you don't have to. You want a thin-crust pizza in Guayaquil, or help putting both feet behind your head in yoga, the first page of google search results will probably get you something.

If you have problems getting a web app to work efficiently with Apache, or mySQL, or django, or whatever, things are a little different. I told you why. Programming is accessible to most people. My 11-year old niece can look at VB macros and intuit what they do. Natural Language Processing of plain-text docs is hard, Hal Daume's blog, one of most accessible blogs is, uh, really not very accessible unless understand linear algebra, statistics, diff eq's pretty well. So doing "semantic" indexing on enormous codebases in dozens of computer languages, and mailing lists / blogs in several natural languages (English, Chinese, Russian, Japanese, various European languages), that is really, really difficult.

That's where stackoverflow comes in. Joel and Atwood are 2 of the leading chroniclers of the practice of software dev, and they know all the web 2.0 hooks: not too dense a layout, tags, easy drill down with tabbed navigation , fast/functional search box, ... Geez, that sure looks like reddit's stylesheet, huh? But those are not breakthroughs, the breakthrough is this. They realized heaps of well-paid, well nourished PhDs making thousands of incremental improvements in crawlers and indexing techniques haven't made finding answers to PHP problems substantially easier. So they're designing a database of software dev info to be indexed.

Joel/Atwood set guidelines for how to ask questions: Don’t combine multiple answers. Then they open the question and each of the answers to be edited / improved/added onto, wiki style. There's no discussion / replies. There's answers. You can take existing answers and edit them, or roll up partial answers into new, better answers. And ultimately, like when wikipedia works, you should have a comprehensive /authoritative answer for your carefully articulated question.

This is pure genius. Now if i could find where i registered for openID logins, ...

Programming and search engines

Yes, Joel, search is hard. Despite the efforts of Guido, Hejlsberg, Armstrong, Gosling, Odersky, Wall, Matz, DHH, McCarthy and the Haskell guys to make development faster, and Larry and Sergei to plop all the world's data, there's a lotta blogs talking about digging through Google search results on one issue for 20 minutes, an hour, an afternoon, ...

Once in a while, somebody has an epiphany/ lightbulb/ out of the box moment: Use a smaller funnel: Yahoo search, delicious tags, Google custom search engine, a subreddit, artima.com, searchyc.com. A few folks *are* blessed with a central repository of info that's comprehensive, peer-reviewed, and well-indexed, like the PHP docs. And Joel and Atwood's stackoverflow, out of beta today, looks like a really beautiful approach to the problem.

Why is it so hard? There are several reasons. Basically it's the sheer volume of stuff, and problems in defining terms.

1. The sheer volume of mailing lists/forums/groups. How many comp.lang.__ lists were there 10 years ago? How many Google groups now? And how many mirrors of those groups (nabble, GMane, activestate, ...). How many ruby users groups' mailing lists are worth reading? I'd say at least 6: Utah, Seattle, Denver, Orlando, Columbus, San Diego, Wellington (NZ)...

These days, first thing a dev starting an open source project does is choose a name, second is start a Google group, third is open Textmate, type "#!/usr/bin/env perl". After a while people fork or submit patches or test suites, so open a second Google group for committers on the project.

2. Another source of mirrors: open source /FOSS / available source code, whatever. Each linux distro and BSD spawns a few mirrors for its repository of RPMs, tarballs, BSD ports, other formats. FreeBSD: I'd say at least a hundred mirrors. Each repo may contain hundreds of thousands of lines of code (millions? Note to self: do some fact checking), popular libs have lots of inbound links, i.e. PageRank. So they're dutifully picked up by the google and yahoo bots. (and then there's sourceforge, Google code, github, ...)

This is a reverse stopword problem. It's useful to index the module, method and class names of the builtin and standard library functions in, say, java project repositories. Everything else, programmer-defined class, method, variable names, are basically throwaways, or decoys. Google search engine seems to do this separation pretty well for their 4 core languages.

2.5 Good blogs are hard to maintain. Over the years a lot of good blogs have been brought down by spam and fly by night web hosts and domain registrars. A number of wordPress blogs were de-indexed by Google this past year because they were injected with malware. A non-trivial number of delicious tags for e.g. java point to Google's archives cause the blog isn't there anymore.

The converse/obverse: lots of valuable programming information resides in blogs that are web design / SEO horror shows: typos/grammos, fucking curse words, URL is mostly a 9 digit random number with a 15-char hex session id, mangled / poorly highlighted source code.

2.6 So how do you estimate the universe of blogs, wiki pages and mailing posts that are worth bookmarking /cataloging for, say, C# devs? Delicious lets you look at 20 pages of stuff for one tag. For javascript, that's less than one day's worth of tags. So, say, 2k-3k tags a day. Technorati has similar volumes, other sites, reddit, magnolia, furl, spurl, ... have less. Can you apply a metric to pick the best blog posts/tagged pages? Sure, just look at # tags/ month and PageRank. Well, actually, that lets you pick out the pages with the best SEO and are most easily understood. This has, again, no easy answer, or i would've put together a database of the 25k best blogs, wikis, mailing list posts on rails programming, and let you subscribe to it for, say, $45/hour. And then I'd be as rich as a Lehman partner.

3. unique terms: Software devs have some, but not quite like medicine or law, say, where absolute precision and recall drive the process. Metabolic processes and pathologies have Latin names for causes/symptoms. The U.S. legal system has procedures to determine which jurisdiction and body of law, criminal/civil, federal / state, case/common or legislated, is applicable , so you know where to point WestLaw or Lexis for your precedents. Python and ruby devs, in contrast, are a mixed group. Some are CS-trained devs doing shrink-wrap apps for sale, some are business analysts, sysadmins, or DBAs, some are physicists or mortgage-backed analysts. So some call it "AOP", others call them decorators, or method hooks, or callbacks, or method intercept, ...

There are now large communities of devs in China, India, Russia, South America, everywhere really, blogging and mail listing in their native tongues. Lots of times I'v seen Delicious tags of stuff that has interesting code and narrative that i can't understand. This is on the upswing, naturally.

The hardest thing to index on the web: blogs in mixed natural languages (i.e. not side by side translations) or mixed source code and natural language writing. How do you tokenize and stem these things?

A web dev has to google for "input form", "web form", "html form". Then ask for: "Never sent", "Doesn't work", "no response". Google for "php html form submit doesn't work", you get nearly 400K hits. Then you attack the issue for 2 of the 5 major browsers it doesn't work in... This is why good web devs and designers are hard to find. This is where they'll IM their network of problem solvers, search on specific individuals' posts to comp.lang.php, hit a carefully built google custom search engine. Or put up a pastie with the minimum code to replicate the problem and the failing test case, jump on IRC, problem solved. This is where the good dev sticks to disciplined, secure, maintainable process flow. They could just find a ajax library that works and stick it in.

Saturday, December 29, 2007

Rails 2, Ubuntu 7.10 Gutsy, ATI radeon, fglrx, dual-head

According to me, some topics are so over-blogged as to be a nuisance rather than a help: Ruby DSL's, ImageMagick on OS X, install Rails on debian/Ubuntu. There must be hundred of blogs and threads on how to install ruby and rails on sarge, etch, badger, dapper, edgy, feisty, and now lenny and hoary.. what's next, molly and itchy? And think, hundreds of delicious tags translates to thousands of Google hits.

Anyway this blog is different. Notwithstanding the foregoing, I claim that installing Gutsy, getting it to run twin-head and running rails 2 is #fast# and *easy*. Nobody's ever said that before. That's why macbooks are selling like, um, twitter subscriptions: buy one, fix screwed up factory ruby install, plug in $200 19" monitor from Costco, buy textmate license, install Rails per hivelogic/macPorts, you're productive. Contrast Debian etch/ubuntu Feisty: install/upgrade O/S, spend up to a week trying to get dual-head config, vim/emacs highlighting /tab completion/project view, SVN client and server etc. to run cleanly, (optional) ponder career switch or suicide, ... buy macbook, you're productive :} This isn't funny, it's happened.

Step 0: start with a fresh Gutsy install if you have more than a dozen backups of xorg.conf in /etc/X11. (I think this procedure would work with Feisty automatically upgraded to Gutsy, but I'm not going to install Feisty and upgrade to find out.) Open a spreadsheet, or openOffic doc to record what you're doing, cause something will be different for each install. Those are my 2 big takeaways: backup xorg.conf and write down every step of the install. Probably not 1st time you've been told...

Step 1: run security updates: System / Administration / Updatemanager... backup /etc/X11/xorg.conf. Enable fglrx (Sys / Admin / Restricted Drivers menu), reboot. Now, per this thread,
sudo aticonfig --initial=dual-head --screen-layout=left
CONT-ALT-backspace to restart X; Note this isn't perfect. In particular, if you edit the conf in Sys/ Admin / Screens and Graphics, it'll wreck xorg.conf, you'll boot to a black void . I said fast and easy, not perfect.

If you're on nvidia/intel graphics card, or have to use open source driver, or whatever, examine this thread for your options: xinerama, TwinView, Merged FB, and note that MergedFB superceded by RandR 1.2. Also, BigDesktop , which is for ATI's open-source driver. See how you end up with dozens of xorg.conf backups, eh? I think a lot of people hand-edited their Feisty xorg.conf, put the xinerama stanzas in per the otherwise excellent Uuntu hacks book, and drove themselves insane trying to fix the missing cursor or missing piece of the desktop or myriad other problems. This guy has tried to make RandR straightforward with GTK, worth a try.

Step 2: do the "sudo apt-get update" and "sudo apt-get upgrade" dance. Next the ginormous command that gives you a working gcc, make and a bunch of other stuff. (stop to complain about Debian's package process, if you like)

sudo apt-get install libglib2.0-dev libusb-dev build-essential autoconf automake1.9 libtool libgnet-dev libhal-dev libhal-storage-dev libdbus-glib-1-dev subversion linux-headers-`uname -r` python-dbus

Don't ask me what it does. I'm a cargo cult. I got it from here. It works.

Step 3: isntall ruby from source. Download the 1.8.6 tarball from ruby-lang.org, tar xvzf, cd to the directory, ./configure, make, sudo make install. You know, shd be familiar. Currently it's the 1.8.6-p111.tar.gz we're looking at.

Step 4: "sudo apt-get install libopenssl-ruby",
"sudo apt-get install rubygems"
"sudo gem update --system"
This is gems 1.0.1, read this if any issues...

Open irb and "require 'rubygems'" Should return "=>true" If not, readline issues, which i think are more of a FreeBSD, OSX , Suse or CentOS/Fedora problem than debian/ubuntu.

Step 5: sudo gem install rails --source http://gems.rubyonrails.org

Step 6: you're not done, you have to set up vim, emacs, komodo, MySQL or Postgres. Or edit with aptana, Jedit or kate or ... ImageMagick, attachment_fu, SWFupload, acts_as_ferret or solr. restful_authentication, OpenID, facebook plugin, backgroundrb, rspec, autotest, firebug, selenium/watir, HAML / SASS/ LiquidView, opera, webkit, VMWare to do IE testing, caching and expiration, sessions with/without cookies, 3 or 4 ajax libs (unobtrusive, LowPro, Jquery), beast forum, SMS/twitter/email_notification, adobe flex ...

Did i forget anything? Oh yeah, Painless PNG plugin, lightbox, Hobo or Streamlined, captcha, Calendar Date Select. CSS for browser reset and tabbed nav: YUI Grids, blueprint, tabnav. Better_nested_set, taggable on more_growth_hormones, state_machine, geokit, sparklines, fckeditor or redcloth or white_list or... Globalize or gettext or gibberish, query_trace. It's entirely possible that rubyworks or Bitnami Rubystack could save you somet ime...

While you're at it, experiment a little: darcs, git, mercurial, camping, merb, sinatra, SQLite, couchDB, seaside, erlyweb. And don't forget about S3, Ec2, monit, god, mint, AWstats, Google sitemap... Oops, premature opt.

But i'm done blogging. I claim each of these topics is adequately and accurately blogged.

Update Jan 1, 2008: the latest install with ATI radeon card (1 SVGA, 2 DVI) doesn't boot past putting up a funny cursor on 2 black screens.