Question

Best bioinfo one-liners?

25

Entering edit mode

9.6 years ago

Manu Prestat 4.1k

Whereas an infinity of efficient tools exists out there, it is sometimes still quicker for achieving simple tasks to execute a one linux command. I'm starting by sharing 3 I use quite often.

## 1 get the sequences length distribution form a fastq file using awk
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'

##2 Reverse complement a sequence (I use that a lot when I need to design primers)
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

##3 split a multifasta file into single ones with csplit:
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

I may be wrong, but I've not found such a list in Biostars.

So, what comes to your mind? I hope this post will yield some gold nuggets ;-)

linux • 8.7k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by Manu Prestat 4.1k

0

Entering edit mode

Paging umer.zeeshan.ijaz

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Dan D 7.4k

0

Entering edit mode

I remember I saw this: Useful Bash Commands To Handle Fasta Files

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by sentausa ▴ 650

Ram · Answer 1 · 2015-05-15

6

Entering edit mode

9.6 years ago

Pierre Lindenbaum 164k

linearize fasta:

cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Just a bit shorter? :)

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by iraun 6.2k

0

Entering edit mode

Wow, a very compact way to handle multiline entries. Excellent!

ADD REPLY • link 9.6 years ago by Manu Prestat 4.1k

0

Entering edit mode

Even shorter! :)

awk 'NR==1 {print ; next} {printf (/^>/) ? "\n"$0"\n" : $1}' file.fas

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

@Frederic: doesn't work.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

@Pierre: hum, it works well on my config (GNU Awk 4.1.1). Maybe a problem with older Awk versions (Awk < 4.0)?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Frédéric Mahé ★ 3.2k

Ram · Answer 2 · 2015-05-15

5

Entering edit mode

9.6 years ago

David Langenberger 11k

## fastq2fasta
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

## bam2bed
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed

##bam2wig
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by David Langenberger 11k

0

Entering edit mode

+1 Especially for the 1st one which avoids the installation of a specific tool (e.g. fastx-toolkit).

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Manu Prestat 4.1k

Ram · Answer 3 · 2015-05-15

5

Entering edit mode

9.6 years ago

venu 7.1k

Here is a list I often use during my work.

My work always starts with a list of ids for which I need to extract fasta sequences..samtools - always a better choice

cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by venu 7.1k

1

Entering edit mode

Nice! And thanks for the link with your whole list.

ADD REPLY • link 9.6 years ago by Manu Prestat 4.1k

1

Entering edit mode

I still cannot help but find the existence of a list of 'favourite one liners' a bit paradoxical, in my opinion.

If they are used routinely for your day-to-day tasks, maybe it is time to 'graduate' them to an utility script, that may be easier to maintain and/or might benefit the others?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Saulius Lukauskas ▴ 540

1

Entering edit mode

One-liners are like old-time Lego(tm) bricks, you can build lots of fun and different stuff with'em.

If you pack them into tools, they will become like current-day Legos: you attach a head, two arms and two legs to a body and voilá, "constructed" a robot.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by h.mon 35k

Ram · Answer 4 · 2015-05-15

Reproducible subsampling of a FASTQ file:

cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times. 0.01 is the % of reads to output.

Deinterleaving a FASTQ:

cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

Archiving / unarchiving files:

Not exactly a one-liner, but I find this a very useful way to make files immutable, and most people probably won't have come across it. An immutable file cannot be modified or written to, deleted, renamed, or moved - even by the superuser. When I receive data or download it, this is the first thing I do, and it's saved a lot of heartache over the years.

sudo chattr +i file.fq #To archive a file
sudo chattr -i file.fq #To unarchive it again

Ram · Answer 5 · 2015-05-15

4

Entering edit mode

9.6 years ago

iraun 6.2k

There are quite a lot useful one liners. This is just a drop in the ocean :).

## Number of reads in a fastq file
cat file.fq | echo $((`wc -l`/4))

## Single line fasta file to multi-line fasta of 60 characteres each line
awk -v FS= '`/^>/{print;next}`{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

## Sequence length of every entry in a multifasta file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by iraun 6.2k

1

Entering edit mode

Single line fasta file to multi-line fasta of 60 characteres each line

How about using fold -w 60 file?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes, that one also works :). I didn't remember fold command!

ADD REPLY • link 9.6 years ago by iraun 6.2k

0

Entering edit mode

wouldn't this one wrap long header lines as well?

ADD REPLY • link 9.6 years ago by h.mon 35k

0

Entering edit mode

Yes, good point h.mon, fold command will also split the sequence names if their size is bigger than 60.

ADD REPLY • link 9.6 years ago by iraun 6.2k

0

Entering edit mode

Why don't you just use word count for "Sequence length of fasta file"?

grep -v ">" file.fa | wc -c

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by David Langenberger 11k

2

Entering edit mode

Ah, got it... for every entry

ADD REPLY • link 9.6 years ago by David Langenberger 11k

0

Entering edit mode

Yep :). Of course if you want the total length the most straightforward command is your command.

ADD REPLY • link 9.6 years ago by iraun 6.2k

0

Entering edit mode

As long as you don't forget the "

cf What Are The Most Common Stupid Mistakes In Bioinformatics? ;-)

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Manu Prestat 4.1k

Ram · Answer 6 · 2015-05-15

2

Entering edit mode

9.6 years ago

h.mon 35k

I've used several "one-liners" from Scriptome.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by h.mon 35k

0

Entering edit mode

Although the idea of this post was to gather one-liners of GNU/Linux tools mainly, this is good to know. One other project of this type is Biopieces.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Manu Prestat 4.1k

Ram · Answer 7 · 2015-05-15

1

Entering edit mode

9.6 years ago

Harshal ▴ 60

Running fastqc for all the fastq files in multiple sample folders in parallel mode.

for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done # will create commands
for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done|sh #launches jobs on cluster

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by Harshal ▴ 60

1

Entering edit mode

Most of these are better done with xargs:

ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by lh3 33k

Ram · Answer 8 · 2015-05-15

Parallelize time-consuming processes on Unix systems

Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.

Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this:

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'