In this tutorial I will introduce some concepts related to unix piping. Piping is a very useful feature to avoid creation of intermediate use once files. It is assumed that bedtools, samtools, and bwa are installed.
Lets begin with a typical command to do paired end mapping with bwa: (./
means look in current directory only)
# -t 4 is for using 4 threads/cores
bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam
Supposed we wish to compress sam to bam, sort, remove duplicates, and create a bed file.
samtools view -Shu s1.sam > s1.bam
samtools sort s1.bam s1_sorted
samtools rmdup -s s1_sorted.bam s1_sorted_nodup.bam
bamToBed -i s1_sorted_nodup.bam > s1_sorted_nodup.bed
This workflow above creates many files that are only used once (such as s1.bam) and we can use the unix pipe utility to reduce the number intermediate files created. The pipe function is the character | and what it does is take the output from one command and sets it as input for next command (assuming next command knows how to deal with this input).
For example, when you type in head myfile.txt
the command 'head' will read first 10 lines of myfile.txt and output it (by default) to the display (stdout) so you will see the first 10 lines printed on your screen. However, if you were to do something like this:
head myfile.txt | wc -l
you will instead get 10
printed out on your screen. What the pipe command did was take the output of head command and used it as input for the wordcount (wc
) command. wc -l
command will count the number of lines passed in (which is 10 since head by default prints the first 10 lines). An equivalent command would be
head myfile.txt > myfile_top10_lines.txt
wc -l myfile_top10_lines.txt
By using the pipe command we are able to eliminate the intermediate file that was generated (myfile_top10_lines.txt
).
Below is an example of how pipe command can be used in samtools
bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq | \
samtools view -Shu - | \
samtools sort - - | \
samtools rmdup -s - - | \
tee s1_sorted_nodup.bam | \
bamToBed > s1_sorted_nodup.bed
A couple of things I wanted to point out here:
- The
\
symbol after the pipe symbol tells the terminal that I want to keep writing command on the next line, that way you can have multi-line commands instead of super-long command on one line. - The
tee
command takes the information being piped and creates a file while passing the data along the pipe, this command is useful if you want to save an intermediate file but still need to do additional processing. (Think of a T shaped drain pipe) - The
-
symbol in samtools it to tell the samtools program to take the input from pipe - It is not recommended you run this command all at once like I showed above if you have a giant bam file because
sort
will take forever. Running things in a pipe can be more space efficient since you are not storing intermediate files but can create issues/bottlenecks elsewhere. If problems arise, run the pieces without piping to troubleshoot. - The two
bwa aln
commands can be run in parallel (for example: on another machine) before thesampe
command
Other common uses of pipes that I have not shown above it to pipe output into snp calling, you can find some examples here: (NOTE: The pileup command is depreciated, the new way to call SNPs using samtools using mpileup command)
samtools merge - *.bam | tee merged.bam | samtools rmdup - - | tee rmdup.bam | samtools mpileup - uf ./hg19.fasta - | bcftools view -bvcg - | gzip > var.raw.bcf.gz
Compressing files with gzip can save a lot of space, but makes these files non-human readable (if you use the head
or less
command it will give you gibberish). To view these files you can use zless
and zcat
command.
zless var.raw.bcf.gz #similar to gzip -cd var.raw.bcf.gz | less
zcat var.raw.bcf.gz | head
In case you are wondering if there is a way to run bwa sampe
without storing intermediate sai files the command below will do this:
bwa sampe ./hg19.fasta <(bwa aln -t 4 ./hg19.fasta ./s1_1.fastq) <(bwa aln -t 4 ./hg19.fasta ./s1_2.fastq) ./s1_1.fastq ./s1_2.fastq | samtools view -Shb /dev/stdin > s1.bam
Notice here that I used the -b
command instead of -u
command in samtools view
. -u
is useful for piping since it spits out an uncompressed bam so you don't have to waste CPU cycles compressing/decompressing at every step, however, if you have a file you want to store, it would be better to compress it with -b
. Additionally, I used /dev/stdin
instead of using the -
symbol, they both mean the same thing (afaik) and I wanted to point it out that /dev/stdin
might be a more explicit way of showing things. (i.e. samtools view -Shu /dev/stdin
)
Update: newer versions of bwa use the bwa mem command to map:
# 4 cores, -M is for `Picard compatibility
bwa mem -M -t 4 ./hg19.fasta ./s1_1.fastq ./s1_2.fastq > s1.sam
This command would replace the following lines of code in the tutorial above:
bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam
While piping can be useful in some cases, it can also be bottlenecked by slow steps. Resource utilization can be monitored by programs like nmon and ole.tange's guide on using parallel in the comments below will be a better solution than using pipes when there are slow steps and you are processing many files.
I think there may be a typo in the piping example, where
samtools sort - - | \
should besamtools sort - | \
. Thanks for the great write-up. Cheers!Hi
I tried piping samtools. It is not working properly. Only first command is doing correctly.
Anyway thank you for sharing the concept of piping.
Deeps
very often when piping into a samtools command you will need a
-
to tell the program you are piping something in, you can also use/dev/stdin
instead of-
in the place of the file and it should also work, if you post an example I could help you outHi Ying,
I already posted the issue in the Biostar. I am sending you the link.
A: How To Combine Two Samtools Commands.
Thanks,
Deepthi
Thanks for the great tutorial! However, I think the
> ./s1_2.sai
part of the above bwa sampe command without intermediate sai files shouldn't be included.Also, do you know of a way to pipe into and out of the
samtools index
command?You're right, thx for catching that, I've changed it. You do not need to pipe anything into samtools index because its creating a .bai file. you could append a
samtools index s1.bam;
to the end of your commandHi Ying,
I was trying to follow the workflow described above but could not get it to work using bwa-mem.. Have you found a way to do this? samtools view complains about truncated input.
This would probably be best as a new question with more details, from the sounds of it though, maybe bwa did not finish or create a proper file
The
\
symbol is not needed if put after|
in Bash. Bash automatically knows the line continues if the last char is a|
.Thanks for the tip
This is great, however nowadays I would use bwa mem which can align paired ends data, and substitute the work of bwa aln and bwa sampe.
Hi All
I have a question. I wish to run bowtie over 3 cores and get an output of aligned sorted and indexed bam files.
The above command just foes till sorted how can I also include a index via the pipe such that I get a sorted aligned bam and also the index in the same pipe?
I look forward for your help.
Thanks,
Yaseen
Remove the second dash in the
sort
command