This is somewhat of a follow-up from a previous post (https://www.biostars.org/p/395813/). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them.
Based on suggestions from the previous post, I tentatively have the following pipeline in mind: 1) Run fastqc, 2) Create contigs using SPAdes, 3) Fully align contigs into their "full circle or full linear alignment" using MUMmer and examine for insertions, deletions, and duplications between the files.
The fastq output looks good. So, I ran spades.py -k 21,33,55,77 --careful -1 read1.fastq -2 read2.fastq
on each sample. I then used R to examine the contigs.fasta files in the main output directory (I think this means the output from using a kmer of 77). Now, I am really lost trying to determine if the output looks decent (i.e. ready for input to MUMmer for comparison between the sequences). Here were my findings:
1) Across the four samples, the number of contigs ranged between 40 and 100.
2) The 5-number-summary of the contig lengths were as follows:
## Sample Min Q1 Med Q3 Max
## 1 Sample 1 78 155.0 520.5 14542.5 199284
## 2 Sample 2 78 237.5 277.0 8684.5 498771
## 3 Sample 3 78 247.0 363.0 34922.0 364001
## 4 Sample 4 78 240.5 268.0 1889.0 364018
3) The 5-number summary of the contig coverage were as follows:
## Sample Min Q1 Med Q3 Max
## 1 Sample 1 0.253165 15.477241 19.47476 33.14796 847
## 2 Sample 2 0.415254 0.805012 20.91779 35.29753 771
## 3 Sample 3 0.770833 0.907975 28.25000 33.77679 1120
## 4 Sample 4 0.544141 0.782492 0.96732 46.50450 1262
My first question is: Do any of these values seem worrisome? (i.e. the large range in the number of contigs? or, certain contig coverage being less than 1? or, discrepancies in the contig coverage between samples, especially Sample 4 having a Q3 for contig coverage less than 1? etc.?)
My second question is: related to the MUMmer process (where I hope to identify differences between the sequences). I have read the MUMmer vignette a few times but am still unclear about:
A) Whether a contig.fasta file is appropriate as input (as each contig.fasta file contains 40-100 contigs in my case, rather than an "fully assembled genome" - assuming there is a difference)?
B) Whether it may be better simply to do the entire assembly process in MUMmer? Rather than align the contigs in SPAdes and then import into MUMmer for an additional alignment?
C) How I can determine how long my sequences are? For instance, MUMmer suggests only using run-mummer1 on small sequences (which they define as <10Mbp). Right now, all I have are contigs. Is it possible for me to know the sequence lengths for the 4 samples without them being fully aligned (assuming a set of contigs is not "full alignment")?
D) How I can determine how similar the sequences are? This can help me decide whether to use NUCmer, PROmer, run-mummer1, or run-mummer3. For now, as a default, I plan to simply run all.
Suggestions to any of these topics would be greatly appreciated. I am having difficult moving forward possibly due to preconceived misunderstandings about the difference between a set of contigs and a fully-aligned genome, a lack of me finding resources demonstrating the transition between SPAdes to MUMmer, and a lack of my understanding of determining the degree of similarity between these contig files.
Thank you for your helpful response @Joe. My advisor is looking into finding out more about the samples but believes they are likely from the same strain. As a result, I blasted some of the contigs from the sample and downloaded a possible reference genome (I'll call it h.fasta here).
I tried/researched several of your advice here.
1) I ran Quast on the contigs output from SPAdes and indeed, as you said, it looks like contig reordering is necessary. So, I wanted to use Mauve software to reorder the contigs against the h.fasta reference file. I tried to do that following these instructions (http://darlinglab.org/mauve/user-guide/reordering.html) by inputing the h.fasta reference file and the SPAdes output (contigs.fasta) file. Unfortunately, I received an error:
2) I guess at this point, I just want to see whether this reference h.fasta file seems legitimate for each of the four samples. I decided to then use BWA as follows:
Do you know of a quick way for me to check from these mapping results, whether the reference h.fasta may be appropriate for all four of these samples? It seems that QUAST also accepts .bed files and that I can continue BWA --> .SAM pathway above to also create a .bed file (http://seqanswers.com/forums/showthread.php?t=76018). So, should I create .bed files (BWA --> .sam --> .bam --> .bed) and then use that as input to QUAST? If not, what is a simple way for me to check the alignment quality of these four samples with the reference?
Hopefully, once I have that figured out, I can also try VarScan as you suggested or the viewer in Muave (which you also referred to). I apologize again if my questions are basic. I feel that, even after reading vignettes, I still stumble upon how to connect different software and pipelines. Thank you for any suggestions.
For point 1), I’m not sure what’s happening with that error. I’m not that great with Java, so it could be anything AFAIK. Are you using a recent version? (Mauve isn’t my favourite tool, but I don’t know a better contig mover personally).
For 2), I would simply check all the mapping stats with Qualimap. If you have high numbers of mapped reads (95%+ or similar), then its likely a good reference. Quast is more for assembly analysis, Qualimap is for mapping quality (which you just need an indexed & sorted SAM/BAM for).
You may need to permit BWA to have some relaxed mapping settings if you’re trying to identify variants (allow one or more mismatches etc). This is really a trial and error process though, so I would just experiment and see what the numbers look like.