Working at the Centre for Virus Research where researchers often do re-sequencing of viruses, I often get asked how to do reference assemblies and generate a consensus. This is a guide for those wanting to learn a few freely available command-line tools. I have used this tutorial to introduce several non-bioinformaticians to the command-line and have then gone on from there to teach them how to create bash scripts to automate their analyses for large numbers of datasets.
Here, I will assume that you are starting from paired-end reads in two fastq files (readset_R1.fastq
readset_R2.fastq
) and a reference in fasta format (ref.fa). I will show you how to use two different aligner (BWA and Stampy). This is not intended to be an extensive demonstration of all the different programs for cleaning reads and aligning them to a reference but an illustration of the tools that I have found most useful.
Quality Control
One of the first things you will want to do when you get a set of reads is check the quality of those reads. FastQC is a very useful tool for this with an intuitive graphical user interface. The following point are important to note when looking at the different tabs in FastQC as they may have implication for the downstream analyses:
- In Basic Statistics check the Encoding. This is important as some software works with different versions of Illumina Encoding and require conversion.
- The per base sequence quality is probably the most important tab to check as it tells you how good your reads are overall. It will provide you with information on how many bases to trim from the 5'- or 3'-end and will give you an idea of the overall quality and quality threshold to use for trimming.
- The per base sequence content tells you whether you might still have primers or adaptors based on the non-random nucleotide content (zigzag at the beginning).
- The latter in combination with the overrepresented sequences tab, will help you to determine whether you need any particular trimming. In this table, you usually find the sequence of your adaptors.
Trimming
Using trim_galore will help clean the sequence but you may need to change some of the default settings depending on the particular library prep you have used. You can check in FastQC afterwards to determine whether the patterns have improved. trim_galore produces sequences of heterogenous lengths. It keeps by default reads above 20bp, I usually find this a bit to short so increase it to 80. It also uses a default quality Phred score of 20.
trim_galore --paired --length 80 readset_R1.fastq readset_R2.fastq
The output names will have changed with _val_1.fq
and _val_2.fq
I rarely find that the sequences need any further trimming but sometimes it might be useful to use PRINSEQ to trim the 5' end. PRINSEQ deals with the pairs separately, so there is a risk that the order of the reads may go out of sync and this could cause trouble downstream for some aligners that expect the reads in the two files to be identical and in the same order.
prinseq -fastq readset_R1_val_1.fq -trim_left 20
prinseq -fastq readset_R2_val_2.fq -trim_left 20
The output files will have a _prinseq_good_xxxx.fastq
added.
The names of files start getting very long so sometimes it is useful to rename files to shorten and simplify them:
mv readset_R1_val_1_prinseq_good_yW3o.fastq readset_clean_R1.fq
mv readset_R2_val_2_prinseq_good_BkEF.fastq readset_clean_R2.fq
BWA assembly
Now that you have good quality reads, you can prepare the reference files for indexing with BWA. At this stage, you might want to check that the names of your sequences in the fasta reference file make sense, as this will be the name given to your consensus sequences later. Creating an index/database of your reference (this speeds up the mapping).
bwa index -a is ref.fa
Paired-end reads are aligned separately and then combined:
bwa aln ref readset_clean_R1.fq > readset_R1.sai
bwa aln ref readset_clean_R2.fq > readset_R2.sai
Then you need to combine the information from the two files:
bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam
The next steps involve creating a bam files which is a binary sam file (takes less space). Samtools is a critical tool in the arsenal for dealing with the alignment file (sam file).
samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam
or to remove unmapped reads at this stage
samtools view -F 4 -Sbh readset_ref_bwa.sam > readset_ref_bwa.bam
You can open up the bam file mapped onto the reference in tablet and visualize the alignment but first you need to sort it and create and index of the bam file
samtools sort readset_ref_bwa.bam readset_ref_bwa
samtools index readset_ref_bwa.bam
If you want to export the unmapped reads, I use bam2fastq. The # is needed when dealing with paired-end to export the unmapped reads from the bam assembly
bam2fastq --no-aligned --force --strict -o readset_ref_unmapped#.fq readset_ref_bwa.bam
Create the consensus. BEWARE samtools has changed a lot and these commands no longer work. The "bcftools call" command that replaces it produces unexpected results.
samtools mpileup -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_cons.fq
Increase sensitivity of samtools using -E
especially when you have multiple nucleotide polymorphisms, for example if you have virus intra-host diversity
samtools mpileup -E -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_E_cons.fq
Convert the fastq format to fasta using seqret from emboss
seqret -osformat fasta readset_ref_bwa_E_cons.fq -out2 readset_ref_bwa_E_cons.fa
Stampy assembly
Stampy is another alignment tool which allows variation (particularly insertion/deletions) and thus can be used if you do not have a closely related reference. Create stampy reference and index:
stampy -G ref ref.fa
Hash table:
stampy -g ref -H ref
Stampy does mapping of paired-end reads together and you can set the divergence for mapping to a foreign reference with --substitutionrate option (default 0.001):
stampy -g ref -h ref -M readset_clean_R1.fq,readset_clean_R2.fq -o readset_stampy.sam -f sam --substitutionrate=0.05
Create bam file and consensus as before:
samtools view -S readset_stampy.sam -b -o readset_stampy.bam
samtools sort readset_stampy.bam readset_stampy
samtools index readset_stampy.bam
bam2fastq --no-aligned --force --strict -o readset_stampy_unmapped#.fq readset_stampy.bam
Be careful with the following commands as samtools and bcftools has changed a lot since I wrote this tutorial:
samtools mpileup -E -uf ref.fa readset_stampy.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_stampy_E_cons.fq
seqret -osformat fasta readset_ref_stampy_E_cons.fq -out2 readset_ref_stampy_E_cons.fa
Once my students are familiar with using these command lines, I start teaching them how to put the commands together in a bash script to create an automated pipeline. Examples of these will come in a later tutorial.
Thank you for the useful information. I would like to ask how do we do a quality control analysis of the mapped reads to the reference genome. Once the reads are mapped I would like to know if we can check the number of reads that mapped to the reference genome both the individual bam file that got mapped also the combined bam file that got mapped to the ref genome. Is there any specific tool to do that? I am aware of FastQC but I will be working in my cluster so any suggestions which I can use to find the number of mapped reads with command line script that can be used on the mapped bam files to assess the number of reads that mapped. Please let me know if you can suggest anything.
Thanks. Very helpful for me. I am now going from SRA to variant calling, I have to do quality control for each short reads data file from SRA. If you have more of your expertise, please share here. +1
Thank you for the tutorial. However, I am unable to follow after samtools sort possibly due to different version. I am relatively new to bioinformatics, previously done de novo genome assembly but now would like to assemble based on a reference to compare, is it possible for you to update your tutorial? Or maybe point me to a better one for reference-based assemblies? My data is paired-end reads of a bacteria genome. Thanks in advance!