Question

Forum:Bioinformatics "Cheat Sheet"

110

Entering edit mode

13.9 years ago

Chris Miller 22k

Inspired by Keith Robison's post on 'cheat sheets', what would you put on a cheat sheet for bioinformatics? This might include one-line scripts, conversion factors, handy rules of thumb, etc.

Some of Keith's suggestions, which have a biology slant:

IUPAC ambiguity codes for nucleotides:

Amino acid single letter codes.

SI prefixes in order.

Powers of 2.

Tm calculation estimation using G+C and A+T counts.

1 human genome ~= 7 pg of DNA

1 bp = 660 daltons

cheat sheet • 23k views

ADD COMMENT • link updated 2.4 years ago by Kevin Blighe 89k • written 13.9 years ago by Chris Miller 22k

3

Entering edit mode

Could you please collect the answers and put them on a cheat sheet blog somewhere?

ADD REPLY • link 13.9 years ago by Chris Evelo 10k

1

Entering edit mode

Instead of blog post, maybe github repo with Markdown/LaTeX would be better?

ADD REPLY • link 13.9 years ago by Piotr Byzia ▴ 10

0

Entering edit mode

And/or incorporating answers here would be nice.

ADD REPLY • link 13.9 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Hey! Brilliant idea to have a cheat code. but this list will go endless unless you give sub categories, like the cheat code for researchers working in bioalgorithm development, genomics, data analysis etc... this will make it more organised.

ADD REPLY • link 13.9 years ago by Dataminer ★ 2.8k

Istvan Albert · Answer 1 · 2011-03-22

42

Entering edit mode

13.9 years ago

Pierre Lindenbaum 165k

5' : left
3' : right

;-)

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 102k • written 13.9 years ago by Pierre Lindenbaum 165k

10

Entering edit mode

-1 my apologies to Pierre as my objection is rather pedantic; if you are looking at coordinates relative to the forward strand (e.g. Refgene), then a gene on the reverse strand would be 5' right and 3' left.

ADD REPLY • link 13.9 years ago by Ian 6.1k

8

Entering edit mode

+1 for the smile while reading ;)

ADD REPLY • link 13.9 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

I actually have a post it note on my cubicle wall with a little picture of genes on each strand and 5' and 3' with little arrows.

ADD REPLY • link 13.8 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

@Ian fair enough :-)

ADD REPLY • link 13.9 years ago by Pierre Lindenbaum 165k

Ram · Answer 2 · 2011-04-25

26

Entering edit mode

13.8 years ago

Madelaine Gogol 5.3k

Not completely bioinformatics oriented, but some things I've found handy.

#subtract a small file from a bigger file
grep -vf filesmall filebig

#use awk to rearrange columns
awk '{print $2 " " $1}' file.txt

#sort a bed file by chrom, position
sort -k1,1 -k2,2n file.bed > file.sort.bed

#strip header
tail +2 file > file.nh

#find and replace over multiple files
perl -pi -w -e 's/255,165,0/255,69,0/g' *.wig

#print line 83 from a file
sed -n '83p'

#insert a header line
sed -i -e '1itrack name=test type=bedGraph' file.bed

#sum column one from a file
awk '{s+=$1} END {print s}' mydatafile

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.8 years ago by Madelaine Gogol 5.3k

2

Entering edit mode

One of my awk aliases is the mean and sd of column 1:

awk '{s+=$1;s2+=($1*$1)} END {print s/NR,sqrt((NR*s2-s*s)/(NR*(NR-1)))}'

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.3 years ago by Chris Penkett ▴ 490

1

Entering edit mode

Oh, that could be useful too, thanks. awk is still a dark interesting rabbit hole to me.

ADD REPLY • link 13.3 years ago by Madelaine Gogol 5.3k

Ram · Answer 3 · 2011-03-22

18

Entering edit mode

13.9 years ago

Aaronquinlan 12k

I have a vision of this cheat sheet being an extensive, very convenient set of environment variables and man pages. It should be versioned and should be on something like GitHub.

For example:

####################
# HG19
####################
$CHR1_SIZE=249250621
$CHR2_SIZE=243199373
...

####################
# Shortcuts
####################
$SUMCOL='awk '\''{ SUM += $1} END { print SUM}'\'

Other informational stats should be rolled into "man" entries. For example,

man dna
man iupac
man 2_powers
man log_examples

This may be utterly harebrained, but it seems useful to me. A community-based, focused wikipedia and shortcut library on the command line.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Aaronquinlan 12k

3

Entering edit mode

Hi Aaron, I started this today :-) https://github.com/lindenb/bioman

ADD REPLY • link 11.8 years ago by Pierre Lindenbaum 165k

2

Entering edit mode

+1 for very clever idea. I like this a lot.

ADD REPLY • link 13.9 years ago by Casey Bergman 18k

0

Entering edit mode

Dotfiles can be intensely personal things. That said, I'd love to have a big central repository of useful stuff to pick and choose from.

ADD REPLY • link 13.9 years ago by Chris Miller 22k

0

Entering edit mode

Fair point, yeah a repo that is organized by type would be more useful.

ADD REPLY • link 13.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Eh, shortcuts isn't a cheat sheet, it's a .bashrc file. So:

sumcol(){
   awk '{SUM += $1} END { print $SUM }'
}

But +1 for the manual pages suggestion, one of the man pages I constantly return to is man ascii. Can we create a b section?

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Ketil 4.2k

Ram · Answer 4 · 2011-03-22

8

Entering edit mode

13.9 years ago

Alastair Kerr 5.3k

missing from the list

1 nucleosome = 147bp
Crude AA to kilo dalton conversion = AA No X 0.11 =Kd
Perl one liners for text conversion
s/015012/012/ # Windows -> Unix
s/012/015012/ # Unix -> Windows

ADD COMMENT • link 13.9 years ago by Alastair Kerr 5.3k

1

Entering edit mode

or

perl -pi -e 's/rn/n/g' input.file

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.5 years ago by Ying W ★ 4.3k

Ram · Answer 5 · 2014-05-02

8

Entering edit mode

10.8 years ago

amolkolte1989 ▴ 80

I have a collection of handpicked reference cards. It helps every now and then.

I prefer to call it as a Bioinformatician's Pocket Reference!!

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.8 years ago by amolkolte1989 ▴ 80

Istvan Albert · Answer 6 · 2011-03-23

7

Entering edit mode

13.9 years ago

Fred Fleche 4.3k

Correspondance between the genome version nomenclature : hg19 (UCSC) = GRCh37 (NCBI)

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 102k • written 13.9 years ago by Fred Fleche 4.3k

0

Entering edit mode

The UCSC Assembly Releases and Versions FAQ does a great job of summarizing a lot of these. Each genome build in the table lists: species, UCSC version, release date, release name/id, and status.

ADD REPLY • link updated 3.3 years ago by Ram 44k • written 10.7 years ago by Malachi Griffith 20k

0

Entering edit mode

...except for chrM/MT where UCSC have a different sequence than the accepted correct one.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Danielk ▴ 640

0

Entering edit mode

Warning: there is a small difference between hg19 and GRCh37 that make a significant influence in the downstream analysis:

in GRCh37, the chromosome name is 1,2,3,4,5,6,7,8,9,..., X, Y

in hg19, the chromosome name is chr1, chr2, chr3, chr4, ..., chrX, chrY

So the mapping results to hg19 cannot be used to GRCh37 directly.

Hope others can avoid the trap I fall in.

ADD REPLY • link 7.4 years ago by Chen Sun ★ 1.1k

0

Entering edit mode

and some degenerate bases have been replaced by 'N' for chr3 and chrY. see: http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 165k

score 6 · Answer 7 · 2011-03-22

6

Entering edit mode

13.9 years ago

Mary 11k

This reminds me a little bit of BioNumbers: http://bionumbers.hms.harvard.edu

ADD COMMENT • link 13.9 years ago by Mary 11k

score 6 · Answer 8 · 2011-03-22

6

Entering edit mode

13.9 years ago

Thaman ★ 3.3k

AUG = Initiation
UAA, UGA, UAG= Termination
AT= 2 Hydrogen Bond, GC =3 Hygrogen Bond and adjacent bases are separated by 3.4Å
Purine= Adenine & Guanine AND Pyrimidines= Cytosine, Uracil & Thymine
DNA replication is semi-conservative

Coming more.... :D

ADD COMMENT • link 13.9 years ago by Thaman ★ 3.3k

score 5 · Answer 9 · 2011-03-22

5

Entering edit mode

13.9 years ago

Michael Schubert ★ 7.1k

Amino acid weights, IEPs
some FASTA statistics one-liners
quick overview of possible cli BLAST inputs/outputs (reading -help takes so long as they are all over the place)
BLAST tabular output column names
Karlin-Altschul formula
definition of PAM and BLOSUM
order of AAs in a substitution matrix/PSSM

ADD COMMENT • link 13.9 years ago by Michael Schubert ★ 7.1k

Ram · Answer 10 · 2011-03-22

5

Entering edit mode

13.9 years ago

Aleksandr Levchuk 3.2k

The cheat sheet for programming in R would be what you are looking for.

Here are good manuals that my advisor, Thomas Girke, wrote:

The HT Sequence Analysis manual was as recommended in Recommend Your Favorite Introductory "R In Bioinformatics" Books And Resources

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Aleksandr Levchuk 3.2k

Tim · Answer 11 · 2011-03-22

4

Entering edit mode

13.9 years ago

Chris Miller 22k

I'll start off with a few of my own:

an alpha-helix has 3.6 residues per turn
A haploid human genome has a little over 3 billion bases and contains around 20,000 genes
A handy alias for summing up a column of numbers from the command line:

sumcol='awk '\''{ SUM += $1} END { print SUM}'\'

ADD COMMENT • link updated 13.9 years ago by Tim ▴ 350 • written 13.9 years ago by Chris Miller 22k

Ram · Answer 12 · 2011-03-22

4

Entering edit mode

13.9 years ago

Jeremy Leipzig 23k

I would like a cheat sheet of arguments for common bioinformatics executables (e.g. blast, clustal, bowtie, bwa, fastx-toolkit), the popular bioperl scripts (like bp_seqfeature_load.pl), as well as the most common bioinformatics things in bash (e.g. mass renaming: foreach f in *fasta; do mv $f `echo $f | sed -e 's/.fasta/.fa'` done)

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Jeremy Leipzig 23k

4

Entering edit mode

Check out the 'rename' program that come with perl is so much better. In this case

rename 's/fasta$/fa/' *fasta

(I assume it comes with perl as it was written by Larry Wall- it is standard on all the latest ubuntu systems)

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Alastair Kerr 5.3k

1

Entering edit mode

With bash, you can use pattern substitution: for f in *.fasta ; do mv $f ${f/fasta/fa} ; done. It is more than twice faster than calling for sed (on a set of 1000 files).

ADD REPLY • link 13.1 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

I find running the executable without arguments usually reminds me what they are ;-)

ADD REPLY • link 13.9 years ago by Neilfws 49k

0

Entering edit mode

Mass renaming is fun until you have to do it on someone else's directory and the file names are full of spaces and accents...

ADD REPLY • link 13.9 years ago by Eric Normandeau 11k

score 4 · Answer 13 · 2011-03-22

4

Entering edit mode

13.9 years ago

Pierre Lindenbaum 165k

My cheat sheet would contain the length of the human chromosomes.

ADD COMMENT • link 13.9 years ago by Pierre Lindenbaum 165k

8

Entering edit mode

For which assembly? :-P

ADD REPLY • link 13.9 years ago by Chris Miller 22k

Istvan Albert · Answer 14 · 2011-03-22

4

Entering edit mode

13.9 years ago

Kevin ▴ 640

building up my list here.. a blog post would be a good record for myself when i change computers or move office where i usually lose my printed copies.

http://kevin-gattaca.blogspot.com/2011/03/cheat-sheets-galore-bioinformatics.html

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 102k • written 13.9 years ago by Kevin ▴ 640

score 3 · Answer 15 · 2011-03-24

Very cool question, here's mine, which probably isn't all that relevant to most biostar members but is popular in our lab:

A table of nucleotide substitution models, and how to set them in the most commonly used programs

Still working on it (you can implement the exotic models in most of the software, but not easily)

score 3 · Answer 16 · 2011-03-27

3

Entering edit mode

13.9 years ago

Pals ★ 1.3k

My cheat sheet would be

Amino acid structures with their properties

And I would also consider Biostar because it is in fact more than google.

ADD COMMENT • link 13.9 years ago by Pals ★ 1.3k

score 3 · Answer 17 · 2011-09-08

3

Entering edit mode

13.5 years ago

ALchEmiXt ★ 1.9k

A useful addition would be a landscape or flowchart how to get from one file (format) into the next..... bioinformatics is about parsing it right....... :)

ADD COMMENT • link 13.5 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

interesting idea .. but off the top of my head I can only think of fastq (1)-> bam (2)-> vcf (3) -> annotated SNPs list of which the path taken depends on the sofware used to (1) map/align (2) call SNPs etc ... Are there file formats that you are thinking about?

ADD REPLY • link 13.0 years ago by Kevin ▴ 640

0

Entering edit mode

Maybe some simples are the interconversion of fastq and fasta+qual; fastq (qual solexa) to fastq (sanger and so forth); conversion of annotation files like EMBL, GBK into each other and or gff; conversion of all sorts of IDs (but there are some good tools for that)....and may be some more....

ADD REPLY • link 13.0 years ago by ALchEmiXt ★ 1.9k

Ram · Answer 18 · 2011-03-22

2

Entering edit mode

13.9 years ago

Neilfws 49k

I like this question, so at the risk of sounding trite: my cheat sheet = a Google search. I store very little information these days; it's as quick and easy to search for it as and when required.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Neilfws 49k

3

Entering edit mode

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what is the percentage of the human genome contained in transcription units?

ADD REPLY • link 13.9 years ago by Jeremy Leipzig 23k

2

Entering edit mode

"At present, about one-third of the human genome appears to be transcribed" http://bit.ly/ga2YFU just the amount of surfing I had to do and still not find that number is evidence enough that a genomics cheat sheet would be handy thing

ADD REPLY • link 13.9 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2 That is a nice bonus, but the second hit: http://www.genome.gov/25521554 tells you what you asked for (1.5-2%)

ADD REPLY • link 13.9 years ago by Chris Evelo 10k

0

Entering edit mode

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what percentage of the human genome contained in transcription units

ADD REPLY • link 13.9 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2

ADD REPLY • link 13.9 years ago by Chris Evelo 10k

score 2 · Answer 19 · 2011-03-22

2

Entering edit mode

13.9 years ago

Paige ▴ 40

Great ideas! I'd add a Blosum62 substitution matrix to the list.

ADD COMMENT • link 13.9 years ago by Paige ▴ 40

score 2 · Answer 20 · 2011-03-24

2

Entering edit mode

13.9 years ago

Samuel Lampa ★ 1.3k

List of most used file formats (.pdb, .bam, .fastq, etc etc), what information they contain, and what they can be used for? (and possibly the most well-known software(s) that reads them)

ADD COMMENT • link 13.9 years ago by Samuel Lampa ★ 1.3k

Ram · Answer 21 · 2011-03-25

2

Entering edit mode

13.9 years ago

Michi ▴ 990

great idea!

a bit of biology:

the citrus cycle! http://student.ccbcmd.edu/~gkaiser/biotutorials/cellresp/images/u4fg35.jpg

or here you can find it also along other must-knows

for R & Regex I already have separate cheatsheets on my desk. One thing I am missing tough, is a cheatsheet for Regex, referring to in which environment one has to escape which characters and back-references (\ or $)

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Michi ▴ 990

0

Entering edit mode

i really like this website for regex http://www.sarand.com/td/ref_perl_pattern.html

ADD REPLY • link 13.5 years ago by Ying W ★ 4.3k

Istvan Albert · Answer 22 · 2011-04-25

2

Entering edit mode

13.8 years ago

Goldbear ▴ 130

Biology by the numbers

http://www.rpgroup.caltech.edu/publications/SnapShot2010.pdf

ADD COMMENT • link updated 12.4 years ago by Istvan Albert 102k • written 13.8 years ago by Goldbear ▴ 130

Ram · Answer 23 · 2011-03-24

1

Entering edit mode

13.9 years ago

hadasa ★ 1.0k

No. seqs(fasta):
```
grep \> file_name | wc -l
```

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by hadasa ★ 1.0k

2

Entering edit mode

there's even a shorter solution : grep -c ">" filename :-)

ADD REPLY • link 13.9 years ago by Pierre Lindenbaum 165k

1

Entering edit mode

@Pierre, I always liked the piped version more. It's only 3 or 5 symbols longer but it easily to swap wc with less or another grep à la LEGO.

ADD REPLY • link 13.9 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

grep > file_name will just truncate your file. needs quotes around the ">" as per @Pierre.

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by brentp 24k

0

Entering edit mode

the editor seems to escape my > so had to write \>

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by hadasa ★ 1.0k

0

Entering edit mode

grep -c "^>" your_fasta

The > sign has to be the first on the line

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Darked89 4.7k

Ram · Answer 24 · 2014-04-16

##Tabulated BLAST header
qseqid sseqid pident alignlength mismatch gapopen qstart qend sstart send evalue bitscore **- **

## go to the end of file in Vi editor
G (shift + g )** **

##substitute in Vi editor

:%s/Soxgene/Foxgene/g

##remove exact duplicate sequences from fasta file

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t ' ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/'

##remove blank lines
sed '/^$/d' file.fasta >Noblanks_file.fasta.out