Question

How To Handle Ns In The Middle Of Reads

4

Entering edit mode

10.9 years ago

kautilya ▴ 430

For my illumina data fastqc shows presence of N's at positions 13,14,15 in 101 bp longs reads. If I go for cropping first 15 bases by using trimmomatic, it solves the problem but I lose a lot of data. I wanted to know that if I retain the N's what sort of problems would they cause during alignment(bwa+stampy)/variant calling(unified genotyper) and how can I handle these problems?

If any body faced a similar problem how did you handle it?

Similar questions asked on different forums but none has answered.

Could not find a resourse on how variant calling programs handle N's. Do they ignore them? Or consider them as a variation with low confidence scores?

Following is the image for per base n content from fastqc http://i43.tinypic.com/sfyz5z.jpg

fastqc qc • 3.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 10.9 years ago by kautilya ▴ 430

1

Entering edit mode

Shouldn't you first investigate why you got those weird Ns at these positions?

ADD REPLY • link 10.9 years ago by Manu Prestat 4.1k

1

Entering edit mode

These are possibly due to machine read errors during sequencing. These are particular to only 1 of 3 runs. Looking for a way of handling these without losing a lot of sequence data.

ADD REPLY • link 10.9 years ago by kautilya ▴ 430

Ram · Answer 1 · 2015-04-20

If you want to do a BWA followed by GATK, I would use your reads as is. They likely have a base quality of 0 and GATK overlooks them. BWA will substitute them for random bases but fallacious alignments induced by those bases will be rare.

The cause? If this is Illumina, sometimes the reagents do not make it to the flowcell for a cycle or two due to pump problems or air bubbles.