we did a 16S RNA analysis using MiSeq kit version 3. This kit makes paired-end reads with a size of 300pb. By merging both reads, we can cover 535 pb of region V3-V5.
As you can see in the following plot, qualities of reads are really bad after the nucleotid 250. Is it normal ? Perhaps the strategy prefers reads lenghs to the quality ? Did you get something similiar with the same kit ?
After merging with Flash
This is a "normal" (par for the course) observation with v.3 2x300 MiSeq kits. You will find many threads about this on SeqAnswers: http://seqanswers.com/forums/showthread.php?t=59558 , http://seqanswers.com/forums/showthread.php?t=73926
Thanks for posting the nice diagram of ribosomal variable regions!
If the q-scores can be trusted, this is actually not too bad. I mean, it's bad, but not unusable; I've seen far worse from 2x300 kits. You will get better quality from 2x250bp kits, but then you can't get the whole region. I'm not an expert on 16S so I can't offer advice on the best strategy or region, but overall this data looks promising.
I recommend that people sequence a known mock community comprised of known organisms with complete references to evaluate true positive/false positive/true negative/false negative rates of a methodology before employing it for real research. This would be less necessary if the quality scores were accurate, but they are not, and Illumina shows little interest in changing that (they've generally become less accurate over time).
You might be able to ascertain the true quality scores by assembling with a kmer longer than the length of the conserved regions, and mapping to the assembly. I;m not sure how well it will work, but it's worth a try. Something like this, using BBMap:
If assembly works, that will tell how accurate your bases actually are. I have not tried it before, though, with amplicon data. Also, "k=93" might need to be increased to cover the longest conserved region, presumably the gap between V4 and V5 based on the diagram. If everything is identical there, it would need to be increased; if they're just mostly similar, it won't. You can try assembling with different kmer lengths and see which one gives the best results; ideally, you want lots of >500bp contigs.
You could also try mapping to Silva, if these are mostly already-sequenced organisms...
I think Illumina sequencing is not famous for long read lengths. I am afraid 300 bp is too much for MiSeq protocol. You will be better off with smaller reads (100-150?)
The kit was purchased from illumina ! https://www.illumina.com/products/by-type/sequencing-kits/cluster-gen-sequencing-reagents/miseq-reagent-kit-v3.html
Yes they sell it, but I hear from users that it doesn't mean that the quality stays good at the end. Check out the posts @genomax2 pointed out.
Typical values for PE-300 runs. But the lowest median Q scores for the merged reads are still >20 (99% accurate). That's sufficient for most applications - is it not for yours?
But I lost 40% of my reads by merging and cleaning with Q>10 .
How did you lose reads by merging? Long as you have a good reference to align to having sub-optimal Q scores may not be a big deal.
I merged my paired-end data with Flash . https://ccb.jhu.edu/software/FLASH/
merged.fastq contains 500pb read length . But only 60% reads are preserved.
Losing 40% is surprising. The first graph indicates that the vast majority of Q<10 bases are in the overlap region (236-300). Merging those should produce Q>10 values for most of those bases. I haven't used Flash; how does it calculate Q scores of overlapping bases?
Flash try to align the overloap between the forward and the reverse. If the alignement score is too low, the read is ignored. Here is the output of flash trying to merge a pairs of fastq. Do you have any suggestion for another tools ? I already tested vsearch and illumina-utils. But both had worst results.
I've gotten good results using BBMerge from the BBMap package (guide available here).
I updated my merged plot. It was after low qualities filtering. Now the plot shows merged just after Flash