I teach an introductory bioinformatics course. For yesterday's lecture I wanted to demonstrate to students just how much you can get done by properly combining all these awesome tools with the unix command line.
And that got me thinking ... so how much can you get done in a day ... how about an hour? ... then .... well, how about a minute ... a minute you say??? ... yeah right that's just crazy talk, sounds like ... mission impossible. Or is it really?
So I googled the Mission Impossible theme song, I found a version that is about 1 minute long and I came up with a challenge with the following rules:
- You may use any tool or background information that can be reasonably expected to be on a bioinformaticians' computer.
- You have to start with an empty folder
- Start the music and your script. Your script needs to finish before the theme song.
- At the end of the run your folder needs to contain a piece of information that on its own is noteworthy and publication quality information (say an essential part of a prior publication)
All right then - and here is my entry. It produces all major single nucleotide polymorphisms of the 2014 Ebola genome as published in Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Science 2014. It requires parallel, efetch, bwa, samtools and freebayes.
The script follows. Let me tell you running it while the theme song is on makes it surprisingly exciting!
It even wastes 16 seconds for dramatic flair. Still finishes in time on a MacBook Air while running a presentation.
#!/bin/bash | |
# | |
# Cue the music: https://www.youtube.com/watch?v=JtyByefOvgQ | |
# | |
# | |
# Edit (Sept 2015): due to an unreasonable slowness of ncbi the script now requires | |
# that you fetch the runinfo fetch once before starting the script | |
# It should not really be slow but it takes almost 30 secons. | |
# | |
# mkdir -p ~/refs | |
# esearch -db sra -query PRJNA257197 | efetch -format runinfo > ~/refs/runinfo.txt | |
# | |
echo | |
echo '**********************************************************************' | |
echo '** Your mission - should you choose to accept it - is to reproduce **' | |
echo '** the SNP results of the Ebola Science paper in less than a minute **' | |
echo '**********************************************************************' | |
sleep 2 | |
echo | |
echo '*******************************************' | |
echo '** I guess better get to it! Tick! Tock! **' | |
echo '*******************************************' | |
sleep 2 | |
# Set up directories | |
mkdir -p sra bam refs | |
echo | |
echo "*** Accessing the NCBI remote firewall ... " | |
echo | |
# Create the reference genome | |
REF=refs/KJ660346.fa | |
efetch -db nucleotide -id KJ660346 -format fasta > $REF | |
# Fix ids to match annotator. | |
sed -i '' 's/gi|674810549|gb|KJ660346.2|/KJ660346/' $REF | |
bwa index $REF 2> /dev/null | |
samtools faidx $REF | |
echo | |
echo "*** Accessing mainframe for run information ... " | |
echo | |
# | |
#esearch -db sra -query PRJNA257197 | efetch -format runinfo > ~/refs/runinfo.txt | |
# | |
cat ~/refs/runinfo.txt | grep "Apr 14" | cut -f 1 -d ',' | grep SRR | head -15 > sraids.txt | |
echo | |
echo "*** Downloading secrets ... " | |
echo | |
cat sraids.txt | parallel fastq-dump -X 15000 -O sra --split-files {} 1> /dev/null | |
echo | |
echo "*** Decoding secrets ..." | |
echo | |
cat sraids.txt | parallel "bwa mem -R '@RG\tID:{}\tSM:{}\tLB:{}' $REF sra/{}_1.fastq sra/{}_2.fastq 2> /dev/null | samtools view -b - | samtools sort -o - tmp > bam/{}.bam" | |
echo | |
echo -e '*** Uploading Visual Basic to open socket ...' | |
echo | |
ls bam/*.bam | parallel samtools index {} | |
echo | |
echo "*** QUICK! DISARM THE EXPLOSIVES ..." | |
sleep 2 | |
echo | |
printf "*** CUT THE \e[1mRED\e[m WIRE " | |
sleep 2 | |
echo | |
printf "*** THE \e[1mOTHER RED\e[m WIRE" | |
sleep 2 | |
echo | |
echo "*** SNP ***" | |
sleep 2 | |
echo | |
freebayes -p 1 -f $REF bam/*.bam > results.vcf | |
echo | |
echo "*** Success ..." | |
echo | |
java -jar ~/lib/snpEff/snpEff.jar -v ebola_zaire results.vcf > ANNOTATED_VARIANTS.vcf | |
echo | |
echo "*** WARNING! The data will self destruct in one minute! ***" | |
echo | |
sleep 2 |
In one minute Chuck Norris sequenced his genome. And found zero mutations.
That would mean that Chuck Norris and Craig Venter are the same person? (Might explain a few things...)
Took you longer than a minute to write the script though..
yeah but Tom Cruise also has time to prepare for each mission - that's still fair comparison
What's really impressive about this is the power behind such basic and standard tools.
btw, I found a bug. the last few lines should read:
ha, looks like I was caught bluffing - funny!