Edit 24 Oct 2023: I use nextflow in projects. This is just a curiosity and should be treated as just experimentation. Take a look at great advice from experienced users below.
This is a tutorial about creating a pipeline for sequence analysis in a single line.
It is made for capture/amplicon short read sequencing in mind for human DNA and tested with reference exome sequencing data described here.
I share the process and debuging steps gone through while putting it together.
Purpose of this exercise is to investigate shell pipes and is purely educational.
Source is available at: https://github.com/barslmn/ngsoneliner/
I couldn't make a longer post, complete version of this post: https://omics.sbs/blog/NGSoneliner/NGSoneliner.html
Pipeline
# fastp --in1 "$R1" --in2 "$R2" --stdout --html $OUTPUT.fastp.html --json $OUTPUT.fastp.json 2>$OUTPUT.$$.fastp.log |
bwa mem -t "$THREADS" -R "@RG\tID:$SAMPLE\tSM:$SAMPLE\tPL:illumina\tLB:lib1\tPU:foo" "$REFERENCE" "$R1" "$R2" 2>"$OUTPUT.$$.bwa.log" |
# pv -cN bwa -s "$(gzip -l "$R1" "$R2" | awk '{print $2}' | tail -n1)" |
samtools collate -@ "$THREADS" -O - |
samtools fixmate -@ "$THREADS" -m - - |
samtools sort -@ "$THREADS" - 2>"$OUTPUT.$$.samtools.log" |
# pv -cN samtools_sort |
([ "$AMPLICON" = "NO" ] && samtools markdup -@ "$THREADS" - - || cat) |
# pv -cN samtools_markdup |
samtools view -C -T "$REFERENCE" - |
tee "$OUTPUT.cram" |
bcftools mpileup --threads "$THREADS" -Ou -A -T "$TARGET" -d 10000 -L 10000 -a "FORMAT/AD,FORMAT/DP" -f "$REFERENCE" - 2>>"$OUTPUT.$$.bcftools.log" |
# pv -cN bcftools_mpileup |
bcftools call --threads "$THREADS" -Ou --ploidy "$ASSEMBLY" -mv |
# pv -cN bcftools_call |
bcftools view -i 'FORMAT/DP>5&&QUAL>5' |
bcftools norm --threads "$THREADS" -Ou -m-any --check-ref w -f "$REFERENCE" 2>>"$OUTPUT.$$.bcftools.log" |
bcftools +fill-tags -Ou -- -t all 2>>"$OUTPUT.$$.bcftools.log" |
bcftools +setGT -Ou -- -t q -n c:'0/1' -i 'VAF>=.1' 2>>"$OUTPUT.$$.bcftools.log" |
bcftools +setGT -Ov -- -t q -n c:'1/1' -i 'VAF>=.75' 2>>"$OUTPUT.$$.bcftools.log" |
/home/bar/ensembl-vep/vep --everything --force_overwrite --vcf --pick --format vcf \
--fork $THREADS \
--stats_file "$OUTPUT"_summary.html \
--warning_file "$OUTPUT"_warnings.txt \
--output_file STDOUT --cache 2>"$OUTPUT.$$.vep.log" |
# pv -cN vep |
bcftools +split-vep -c SYMBOL,gnomADg_AF:Float,IMPACT,Existing_variation 2>>"$OUTPUT.$$.bcftools.log" |
bcftools filter --threads "$THREADS" -Ou -m+ -s 'onTarget' -M "$TARGET" |
bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowQual' -g3 -G10 -e 'FORMAT/DP<=15 || QUAL<=20' |
bcftools filter --threads "$THREADS" -Ou -m+ -s 'highFreq' -e 'gnomADg_AF>0.001' |
bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowIMPACT' -i 'IMPACT~"HIGH" || IMPACT~"MODERATE"' |
bcftools filter --threads "$THREADS" -Ou -m+ -s 'HOMrare' -e 'GT="1/1" && (gnomADg_AF <= 0.001 || (Existing_variation="." && gnomADg_AF="." && ID="."))' |
bcftools filter --threads "$THREADS" -Ob -m+ -s 'HETnovel' -e 'GT="0/1" && Existing_variation="." && gnomADg_AF="." && ID="."' |
tee "$OUTPUT.bcf" |
bcftools +split-vep -f "$columns" -d -i 'CANONICAL~"YES"' 2>>"$OUTPUT.$$.bcftools.log" |
awk -v header="$header" 'BEGIN {printf header} 1' |
gzip -c >"$OUTPUT.tsv.gz"
Breakdown
Preprocessing fastq
fastp with the
--stdout
parameter writes interleaved output. We are just using default operations. Because of different results its commented out and not being used.Alignment
bwa can take interleaved input from stdin with `-p` option. This was used to get the input from fastp stdout. We are just starting from bwa by giving the both fastq files as inputs.
Processing aligned reads
Samtools collate, fixmate, sort, markdup submodules are used. We use samtools to mark duplicate reads. http://www.htslib.org/doc/samtools-markdup.html markdup has a conditional check beforehand for a variable named
AMPLICON
which if true skips the markdup and just cats the alignments.Writing out the alignment file
samtools view is used to convert the bam file to cram. We can use tee to write out the cram file while passing it down the pipeline.
Variant calling
mpileup has
-d 10000 -L 10000
parameters in case there are high depth amplicon sequencing. Beware the 1.3.4 commands before the call.Post processing variants
Check out this section to understand rationale behind this view command. Variant normalization is performed by bcftools norm. In fill-tags command the specific tag I need is the variant allele fraction (VAF). We later use the VAF with setGT heterozygous and setGT homozygous commands.
We are using -Ov before the VEP command because I had problem with VEP reading compressed from stdin. https://github.com/Ensembl/ensembl-vep/issues/1502
annotating variants
A bare bone vep command. This brings a lot of annotation with the
--everything=
flag.filtering variants
Here we use split-vep and add some of the columns we’re interested in into the INFO column before filtering. Filtering expressions depends on what we’re trying to achieve, I just put the ones I like. 28 29 30 31 32 33
Here no variant is excluded but the tags we give with the
-s
parameters are appended to theFILTER
column. After the filtering we write bcf with thetee
command.writing out a tsv file
We define the columns we want to have in the final table columns and use split-vep to format it into a table. We can add these columns as the first line with awk after we format it as a header. We can than use gzip to keep it compressed.
Results and highlights
As result this report is created with multiqc. It would be better make a hard filtered vcf and create the report that one but since this test already has small number of variants it wasn’t required.
- Test the command by running it multiple times.
- Count the data in intermediary steps (md5sum, mapped and paired reads, number of variants etc.).
I would be wary of this for a few reasons.
If you actually wanted to call variants quickly in one line, you would get better results faster doing something like this:
bbmap.sh in=preprocessed_reads.fq ref=ref.fa out=mapped.sam nodisk; callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf
Of course you can pipe those too if you want, but it just wastes memory without actually running substantially faster.
Thanks a lot for your feedback. The main idea is to how much I can push the shell pipes and showing some approaches to problems when they arise.
Yes, it can be fragile, and I spend way too much time debugging this more than I like to admit. Here is one such case.
I didn't compare the memory usage to running the command distinctly but I keep track of cpu and memory usages and create a plot (example here). It didn't jump to me as using too much memory. Would it be using more memory that it's not tracked by processes memory usage?
Yes, there are a lot of specific filters or commands with
setGT
for instance. This is certainly not something one can copy and paste and use as is.I only looked in to running time when there was something wrong that it took way longer than it supposed to. And main concern was not making it slower than running the tools in distinct steps. Multiple stages is one of the first concerns and needs extra attention which is discussed here.
We can write out any intermediate results with
tee
commands. Which is also useful for debugging purposes. If any speed gain there is to it, I guess would come here, since it starts the variant calling while writing the alignment file to the disk.I have seen bbmap a lot, unfortunately haven't got the chance to try it out. I would like to use it in a future project.
That's all you need. You're caught up in sunken costs. Abandon ship and learn workflow stuff like snakemake or Nextflow. This script is bad and it is wasting your time.
As can we with custom file descriptors. That is not the point. Unix has a simple philosophy of each tool doing one thing really well, and your "pipe" ruins that philosophy.
I just had the idea that seemed interesting and seen it through the end. I use and advise for workflow managers in projects.
While I applaud your contribution this line does not tell the user about what kind of "sequence analysis" is being done.
NGS one liner to call variants
may be a more apt title? Certainly a strongcode golf
candidate (after looking at the longer post linked above).That observation about
fastp
is also strange. In what way are the three outputs different? Do you get different read numbers? Order of reads would probably be different if you use more than one thread.Thanks a lot for your feedback.
Purpose of the exercise is to how much I can build with just using shell pipes. I did show this to a someone who have some acquaintance to linux and bioinformatics but not building pipelines. However, our study was not on calling variants and we just use alignment and not the other parts. So, even though I don't suggest using this as a whole, parts of it can be of use.
Absolutely, it should've been made clear and was an oversight. It is made for capture/amplicon short read sequencing in mind for human DNA and tested with reference exome sequencing data described here.
I had gone through some different titles and gone with a catchy one but being bit more descriptive would be more helpful and will change the title.
Yes, threading was the first thing I checked but I was running it single threaded. Fastp fixed earlier randomness problems with threading. I haven't investigated further in since scope of the post was to get the pipeline running.
I think people should start chaining commands together in shell scripts, then progress to something like snakemake. This approach is a giant step backwards - Brian has raised a number of valuable points; I'll give you my gut reaction:
No, please. Noooooooooooo! You're a respected user on the forum, please don't guide people down a bad path. Beginners will copy paste your code with no idea of
set -o pipefail
and say "didn't work" when the command waits for completion on a long step. Things you can pipe are likesamtools sort - > samtools view -C -T ... -o xyz.cram
to go from unsorted BAM to CRAM. The pipe you've written is too intense to be a single command.It certainly has a surprising effect and started a discussion I learned from. You're correct about misguiding beginners, I have taken your advice and change the post type to blog and added a disclaimer to the top.
That's a nice disclaimer - this script is a nice plaything for people who know how to get stuff done, but it def is a quagmire for beginners.