Question

Systematic errors at the end and beginning of reads in NGS panels

0

Entering edit mode

21 months ago

captainlabman ▴ 20

Getting data with illumina's sequencers, we have sequencing errors in a very large number of reads. Most often they are located at the ends of readings, less often at the beginning. Errors in the last and first nucleotide reads. Why are these errors and not genetic variants? Because these errors are absent in the same positions in intersecting readings. It is not clear where they come from. Whether the sequencer itself creates errors, or whether these are some errors in the preparation of next generation sequencing data. BWA is used as an aligner, strelka2 is used as a variant caller. The error data quality metrics are very good, they cannot be filtered by quality. Help me understand what it is and how to deal with it.

enter image description here

sequencing NGS-panels • 2.9k views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 21 months ago by captainlabman ▴ 20

1

Entering edit mode

The structure of the reads implies it's not shotgun sequencing. Maybe it's the amplicon primer sequence?

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

This should not be the amplicon primer sequence as these sequences are truncated during data processing. It is worth clarifying that the panels are really obtained using amplification technology.

ADD REPLY • link 21 months ago by captainlabman ▴ 20

0

Entering edit mode

Besides, only a part of the samples contain these errors in the same place. and also if these were the remains of primer sequences, then the errors would be located exclusively at the beginning of the reads, but the errors are predominantly at the ends.

ADD REPLY • link 21 months ago by captainlabman ▴ 20

0

Entering edit mode

Can you share the diagram of the primers? Are you removing the sequencing primers or your amplicon primers? The amplicon primers should be nested within the sequencing primers

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

If I am reading the image above right then the amplicons do not seem to cover the expected length (if the amplicon blue bar is what the product is supposed to be). Perhaps there is some other location in genome that is causing the short products?

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

I think I understood it now. They remove the amplicon primers from the 3' end so the reads don't reach the end of the amplicon. And yes, it is weird that the errors are only on the plus strand reads.

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

The reads do not really cover the full length of the amplicon. This is due to the peculiarities of the Illumina NextSeq sequencer. It is capable of sequencing sequences up to 150 nucleotides in length. This is the length of the presented reads

ADD REPLY • link 21 months ago by captainlabman ▴ 20

0

Entering edit mode

See if, by any chance, the error nucleotide is a G as G is no signal in nextseq.

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

If I understand your request correctly, then in this case there is an erroneous genetic variant A > C. Also we use MiSeq and have the same problems =)

ADD REPLY • link 21 months ago by captainlabman ▴ 20

1

Entering edit mode

Since you have overlapping reads you probably should merge them before doing alignments.Looks like you have random errors in individual reads.

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

Do you mean in one fastq file?

ADD REPLY • link 21 months ago by captainlabman ▴ 20

0

Entering edit mode

Create a full length representation of the amplicon from each pair of reads using a program like bbmerge.sh , FLASH etc.

Guide for bbmerge: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

Thank u, I'll ty it! If it's not difficult for you, can you explain how in this case a nucleotide will be selected if it has one nucleotide in the right read and another in the same position in the left read?

ADD REPLY • link 21 months ago by captainlabman ▴ 20

1

Entering edit mode

Please see the guide linked above. That should explain everything. You can follow the section at the end titled Recommended command for optimal accuracy since you probably need/want that.

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

I assume your miseq reads are longer, can you verify that the paired reads have a different nucleotide in this position?

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

You mean you need to sequence this sample specifically with MiSeq and see if this variant is there? I didn't really understand. Or see if that genetic variant occurs at the same location in other samples that have been sequenced with MiSeq?

ADD REPLY • link 21 months ago by captainlabman ▴ 20

0

Entering edit mode

What I suggested is looking for the reads that have the error on R1, look for their corresponding R2 and make sure that the nucleotide there is different. If this is indeed the case then it's very weird, especially if it's MiSeq and quality is good. If it's not the case then maybe it'll help figure out the source of the error.

ADD REPLY • link 21 months ago by Asaf 10k

0

Entering edit mode

This data is NextSeq as noted above.

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

Yes, but he mentioned they have the same issue with MiSeq reads as well.

ADD REPLY • link 21 months ago by Asaf 10k