Getting data with illumina's sequencers, we have sequencing errors in a very large number of reads. Most often they are located at the ends of readings, less often at the beginning. Errors in the last and first nucleotide reads. Why are these errors and not genetic variants? Because these errors are absent in the same positions in intersecting readings. It is not clear where they come from. Whether the sequencer itself creates errors, or whether these are some errors in the preparation of next generation sequencing data. BWA is used as an aligner, strelka2 is used as a variant caller. The error data quality metrics are very good, they cannot be filtered by quality. Help me understand what it is and how to deal with it.
The structure of the reads implies it's not shotgun sequencing. Maybe it's the amplicon primer sequence?
This should not be the amplicon primer sequence as these sequences are truncated during data processing. It is worth clarifying that the panels are really obtained using amplification technology.
Besides, only a part of the samples contain these errors in the same place. and also if these were the remains of primer sequences, then the errors would be located exclusively at the beginning of the reads, but the errors are predominantly at the ends.
Can you share the diagram of the primers? Are you removing the sequencing primers or your amplicon primers? The amplicon primers should be nested within the sequencing primers
If I am reading the image above right then the amplicons do not seem to cover the expected length (if the amplicon blue bar is what the product is supposed to be). Perhaps there is some other location in genome that is causing the short products?
I think I understood it now. They remove the amplicon primers from the 3' end so the reads don't reach the end of the amplicon. And yes, it is weird that the errors are only on the plus strand reads.
The reads do not really cover the full length of the amplicon. This is due to the peculiarities of the Illumina NextSeq sequencer. It is capable of sequencing sequences up to 150 nucleotides in length. This is the length of the presented reads
See if, by any chance, the error nucleotide is a G as G is no signal in nextseq.
If I understand your request correctly, then in this case there is an erroneous genetic variant A > C. Also we use MiSeq and have the same problems =)
Since you have overlapping reads you probably should merge them before doing alignments.Looks like you have random errors in individual reads.
Do you mean in one fastq file?
Create a full length representation of the amplicon from each pair of reads using a program like
bbmerge.sh
,FLASH
etc.Guide for bbmerge: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmerge-guide/
Thank u, I'll ty it! If it's not difficult for you, can you explain how in this case a nucleotide will be selected if it has one nucleotide in the right read and another in the same position in the left read?
Please see the guide linked above. That should explain everything. You can follow the section at the end titled
Recommended command for optimal accuracy
since you probably need/want that.I assume your miseq reads are longer, can you verify that the paired reads have a different nucleotide in this position?
You mean you need to sequence this sample specifically with MiSeq and see if this variant is there? I didn't really understand. Or see if that genetic variant occurs at the same location in other samples that have been sequenced with MiSeq?
What I suggested is looking for the reads that have the error on R1, look for their corresponding R2 and make sure that the nucleotide there is different. If this is indeed the case then it's very weird, especially if it's MiSeq and quality is good. If it's not the case then maybe it'll help figure out the source of the error.
This data is NextSeq as noted above.
Yes, but he mentioned they have the same issue with MiSeq reads as well.