Wednesday, November 3, 2010

...passing it along...

Every once in a while, you'll run across a serious problem (like downloading several thousand bacterial genomes from NCBI) and think to yourself, "There has got to be an easier way than doing this all by hand."  As you try and figure out a snippet of code to piece it together, you think to yourself, "This can't be that obscure of a problem.  I'm positive someone else has done this before."  And then, with a few clicks of the mouse, you find it.  Such was the case today.  Much thanks to Peter Cock (?) at http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/python/ftp/ for publishing the code.

So, if any of you are trying to figure out how to download all 10k+ Bacterial genomes (or anything in batch) and are working on a remote UNIX system, here's how to do it:

Step 1:
Download ftputil from http://ftputil.sschwarzer.net/trac
If you're unfamiliar with Python (as I am), here's how to install it:
tar -xzf ftputil-*.tar.gz
cd ftputil-*
./setup.py install --root ~ # This will install it in ~/usr/lib if you don't have install permissions for /usr/lib)

Step 2:
Create a file called get_gens.py (or whatever you want), and put this in it (I've changed the original code to download all the .fna files. It's easy to change to download something different.)

#! /usr/bin/python

# append the library where ftputil is located
import sys
sys.path.append("~/usr/lib/python2.4/site-packages/")

import ftputil
import string
import os

#Where to put the genomes (script will create sub directories
#using the same names as the NCBI use).  This directory must
#exist already:
base_path="~/genomes/Bacteria/"

host = ftputil.FTPHost('ftp.ncbi.nlm.nih.gov', 'anonymous', 'password')
host.chdir('/genomes/Bacteria/')

dir_list = host.listdir(host.curdir)
for dir_name in dir_list :
    host.chdir('/genomes/Bacteria/')
    if host.path.isdir(dir_name):
        print dir_name
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        for file_name in file_list :
            #if file_name[-4:]==".gbk" :
            if file_name[-4:]==".fna" :
                print "File " + file_name
                if not os.path.isdir(os.path.join(base_path,dir_name)) :
                    print "Making directory " + os.path.join(base_path,dir_name)
                    os.chdir(base_path)
                    os.mkdir(os.path.join(base_path,dir_name))
                if os.path.isfile(os.path.join(base_path,dir_name,file_name)) :
                    print "Skiping file " \
                          + os.path.join(base_path,dir_name,file_name)
                elif host.path.isfile(file_name) :
                    print "Downloading file " \
                          + os.path.join(base_path,dir_name,file_name)
                    host.download(file_name, \
                          os.path.join(base_path,dir_name,file_name), 't')
                    #Download arguments: remote filename, local filename, mode
                else :
                    print "ERROR - Not a file " + dir_name + "/" + file_name

Wednesday, October 27, 2010

Dreaded Words

There are certain phrases that are bound to produce fear in the heart of any mortal human being.  For instance, no parent wants to hear the words "Do you remember how Kevin's arm used to bend like this?"  It's a bad sign.  It's also never good when a distant acquaintance calls you up, pretending to have a sudden, profound interest in your life.  You know they really just you to spam all your email contacts with a "great internship opportunity" or (even worse) set you up with a "tall friend."

I'm beginning to learn that there's another phrase that incites just as much fear and dread into my heart:  "the reader may enjoy proving this as an exercise."  This is the sign of an extremely accomplished (but also somewhat sick and perverted) theoretician.


Friday, October 22, 2010

Steely-Eyes

Although I am a computer scientist, I wouldn't consider myself to be entirely introverted.  I can work alone on a project for 20 hours straight until 5am, but I also have been guilty of postponing studying in favor of socializing.  This inclination to treat my associates as human beings has, unfortunately, given me a lack of understanding to some of the more complicated aspects of a computer scientist.

Let me give you an example:  The Computer Scientist's Gaze.

When individual A matches the searching gaze of individual B (given that A and B are elements of the set of all Computer Scientists), this process is a mere formality for what must come next.  A must always submit themselves to an endless pontification from B about B's work, its complex theory, and its possibility to solve the most complex of problems (WLOG, we can assume these problems include—but are not limited to—world peace, starving children in Ethiopia, and any religious debate).  Through the course of this dialogue, A will also be required to understand several sloppily-written proofs on the back of a greasy napkin, providing some level of feedback, if only to suggest that "It would be easier to understand for a reader less knowledgeable than I if you changed the variable r here to be the Greek symbol ρ."  By the end of the mind-numbing conversation, A will also have promised B (upon penalty of another painfully boring lecture) that he or she will attend the next group meeting, lecture series, and sit in at least half of the remaining periods of their graduate course.

It's easy to see why you don't want to catch someone's gaze when you're into computer science.  It's not that we hate each other's work, it's just that we've spent so much time trying to convince ourselves that we're actually doing something worthwhile (instead of just tweaking parameters and hoping for marginal success stories to publish).

Monday, September 13, 2010

Parallel Algorithms part XXX

Disclaimer:  If you have not taken a course in complexity theory of computation (or if you have a life), you might not find this post all that entertaining.  Just be warned.


I'm in a Parallel Algorithms Theory class right now, and since I've already taken a few classes in parallel programming and had lots of experience, I thought it might be insightful.  Unfortunately, I learned the truth about the class today and am still trying to find the missing link to reality.

It seems that there are three general steps to creating a "work-time optimal" algorithm.  These are, in order:
  1. Assume an extremem number of processors, such that p>>n. This works best for doing something trivial, such as searching a sorted array for a given element. n can be any arbitrary number between one and the size of an integer (around 4 billion).
  2. Create a convoluted algorithm such that:
    • The number of parallel steps is very small (Try getting close to O(log log n) )
    • The number of steps required to orchestrate parallelization and setup is significantly large (preferably close to n2)
  3. Hand-wave and use several mathematical approximations with Big-Oh to show that your new algorithm is actually constant time O(1).
Once this has been done, publish your results and teach a Parallel Algorithms course.

Friday, August 27, 2010

Everything's Bigger in Texas

Well, I've been silent on the blogging for some time, but since I'll be away from home for a while, I figured I'd start it back up.

This past summer, I decided I wasn't living life "on the edge" enough, so within a ten-day time period, I bought my first car, graduated from BYU (a two-day affair), drove 22+ hours to an entirely new state (and almost country), and started graduate school at the University of Texas at Austin.  Oh.  And within 14 days, I had caused $1400 worth of damage to an otherwise steal-of-a-deal car.

I rather like it here.  I didn't really have many expectations, other than the oft-repeated statement of pride, "Everything's Bigger in Texas," so I was pleasantly surprised.

First, everything is bigger in Texas.  I ordered a glass of water at a local Tex-Mex  restaurant ("it's authentic," Alyse's family said—which begs to question the authenticity of a food type that contains the state it's cooked in and the country from which it is influenced in its title), and it really was bigger.  You can see the straw barely poking up over the top on the back side:  I only needed refills once instead of the usual 6 or 7 times.  I'm glad I'm not in Germany.

I also like the openness of Austin.  Since Alyse served a mission here a few months ago, she drove down with me and stayed with her family.  Taking her home after we'd finished our day's festivities, we drove through the countryside.  I would have parked my car, got out and enjoyed the full moon and big farms (or at least rolled down my window to take in the smells) had it not been for the 80 degrees and 95% humidity at 11:30 at night.

Mostly, I love everybody's inner yee-haw. Walking from the institute building to school today, I saw a group of workers including a Hispanic man wearing cowboy boots.  He's real Tex-Mex.

Maybe I'll grow to love this place.  And maybe, during the cooler months this "winter," I'll even wear my cowboy boots once or twice.  Just for kicks.