Question

Help Me Understand Why So Many Reads Do Not Align!

3

Entering edit mode

13.6 years ago

Pfs ▴ 30

I have data from 3 Illumina lanes (same tissue, different conditions). The number of reads obtained is comparable. Yet, when I align the reads to the reference genome, one of the lane has 1/3 of the alignments.

How can I explain this? Why so many reads don't make it to the tophat output even though they are all good quality reads (Q>30)?

Originally I have 115,022,942 reads x 2 (paired)

This is the flagstat output:

35,947,783 mapped (100%)
35,947,783 paired in sequencing (100%)
28,432,680 properly paired (79.09%)
28,791,584 with itself and mate mapped
7,156,199 singletons (19.91%)

tophat alignment read • 9.1k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 13.6 years ago by Pfs ▴ 30

1

Entering edit mode

it might help if you add the flagstat output for one of the lanes that mapped better. And you could run FastQC on your fastq input files to see if there's something wrong with the 3rd lane.

ADD REPLY • link 13.6 years ago by brentp 24k

1

Entering edit mode

You might want to clarify the question a bit, my first reaction was that of course each of the three lanes would have 1/3 of the alignments :-)

ADD REPLY • link 13.6 years ago by Ketil 4.1k

0

Entering edit mode

@Ketil: I meant one lane has 33% of the its original reads aligned. The other two have close to 100% of their reads aligned.

ADD REPLY • link 13.6 years ago by Pfs ▴ 580

0

Entering edit mode

We also have this problem. Only half of the reads can map to the genome. But I have no idea what's going wrong. I will follow this discussion.

ADD REPLY • link 13.6 years ago by Dejian ★ 1.3k

score 8 · Answer 1 · 2011-04-06

8

Entering edit mode

13.6 years ago

Gareth Palidwor ★ 1.6k

You may have contamination: take a bunch of the good reads that don't align and blast them against NCBI nr to see what you find.

You should always run general diagnostic analysis like FastQC on your data as well as it will give you a sense of what is anomalous about runs you are having difficulty with.

ADD COMMENT • link 13.6 years ago by Gareth Palidwor ★ 1.6k

score 2 · Answer 2 · 2011-04-07

2

Entering edit mode

13.6 years ago

Ketil 4.1k

We have recently used Fasteris to do our Illumina sequencing, and they provide a nice report with each run, which includes the number of recovered controls. This number seems to correlate really nicely with the number of reads that align, and is usually a few percent off the total.

Since Illumina is fairly robust technology at this point, I'm with gawp here - you probably have some kind of contamination in your sample. Since you seem to be doing RNAseq, you should be able to find some matches against UniProt or NR.

That said, some other things to check:

Illumina runs vary quite a bit in the number of reads produced, you are talking percentages here, and not just raw numbers? Of course, a 5GB Illumina run will give you half the alignments of a 10GB run...
Read duplication still seems to be an issue. If your library prep ended up with too little DNA (or RNA here, I guess?), chances are many reads are duplicated. Make sure that these aren't filtered silently out at some stage, or that you are in fact counting aligned reads, and not, say, expressed genes.

ADD COMMENT • link 13.6 years ago by Ketil 4.1k

0

Entering edit mode

@Ketil: maybe I am misunderstanding what you mean here, but my concern is not on the number of reads produced by the Illumina Sequencer. My concern is that out of 100 million reads in one particular lane, only 30 million align to the reference genome.

ADD REPLY • link 13.6 years ago by Pfs ▴ 580

0

Entering edit mode

Nevermind. I think I get now what you meant.

ADD REPLY • link 13.6 years ago by Pfs ▴ 580

0

Entering edit mode

Right, it just wasn't entirely clear. As I said, it's probably contamination.

ADD REPLY • link 13.6 years ago by Ketil 4.1k

score 1 · Answer 3 · 2011-04-06

1

Entering edit mode

13.6 years ago

Chris Evelo 10k

The first thing I can think of is that that 3rd sample actually did not contain what you thought it contained. Could be highly polluted with other DNA, but with these percentage actually swapping a sample might be more likely.

ADD COMMENT • link 13.6 years ago by Chris Evelo 10k

score 1 · Answer 4 · 2011-04-07

1

Entering edit mode

13.6 years ago

Pfs ▴ 580

Thank you all for the suggestions.

I run FASTQC and among other things, I see a lot of duplicates (>80% of the reads are duplicates).

Does it make any sense remove the duplicates and work with what is left (~20 million reads) or the entire run is compromised?

Also, other lanes seem to have similar levels of duplications, yet they align well with the reference genome. Are these two things uncorrelated?

Thanks!

ADD COMMENT • link 13.6 years ago by Pfs ▴ 580

1

Entering edit mode

If you are doing RNAseq without normalization, it's perhaps not unreasonable to have a lot of ribosomal RNA. You should look at what genes the reads are from - after all difference in expression is half the point :-) But yes, they should all align to the genome.

ADD REPLY • link 13.6 years ago by Ketil 4.1k

0

Entering edit mode

Beware FASTQC dulicate estimates, they can be very inaccurate. They're based on a sample of the reads (from docs: "only sequences which occur in the first 200,000 sequences in each file are analysed"). This gives particularly bad estimates if the files are non-randomly distributed (like being sorted). I've found it better to determine duplicate levels by other means than FastQC.

ADD REPLY • link 13.6 years ago by Gareth Palidwor ★ 1.6k