Question

Tophat - Htseq Count? How Can I Improve The Mapping Percentage For Single End Reads

0

Entering edit mode

11.4 years ago

k.nirmalraman ★ 1.1k

Dear All,

Probably this is a novice question.

I have single end reads of size 100bp from Illumina TruSeq sequencing. I am using mouse genome build mm9 from TopHat Index and annotation downloads.

The library is 16 Million reads

My tophat command is as follows:

tophat -p 4 -N 3 --read-gap-length 3 --read-edit-dist 3 --output-dir <path> <genome_path> path/to/input.fasta

HTSeq Count

python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_id -t exon accepted_hits.sam /mm9/genes.gtf > counts.txt

After HTSeq Count

 no_feature          4973501
 ambiguous           125622
 too_low_aQual       0
 not_aligned         0
 alignment_not_unique        5620063

The HTSeq output has the above statistics... Is it normal to have such kind of numbers for no_feature and alignment_not_unique for single end sequencing. Is there something that can be done to improve this statistics.

Well as an extension to the above question, what happens if the features that are not unique if counted for both genes in Differential expression analysis? Does this cause any bias?

Thanks in advance!

htseq tophat rnaseq • 6.8k views

ADD COMMENT • link 11.4 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

That's a pretty high level of no_feature for the mouse genome. Did you purify for anything unusual at some point? Having around 10% no_feature isn't unheard of, but over 25% is kind of over the top. You might want to look where some of those no_feature reads are aligning. Perhaps you have a bunch of DNA contamination or just really high amounts of pre-mRNAs?

ADD REPLY • link 11.4 years ago by Devon Ryan 105k

0

Entering edit mode

Just to agree with @dpryan, take a look at your data in a browser. You'll likely learn a lot, particularly if you are relatively new to these data.

ADD REPLY • link 11.4 years ago by Sean Davis 27k

0

Entering edit mode

I would definitely like to look into these reads in a genome browser. Thanks for that direction! :)

ADD REPLY • link 11.4 years ago by k.nirmalraman ★ 1.1k

0

Entering edit mode

There was no unusual purification method used.

ADD REPLY • link 11.4 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

Maybe your question should be: "Can I improve ..." in stead of "How can I improve". Maybe doing single-end sequencing doesn't fit your study/sample very well.

ADD REPLY • link 11.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

What's the experiment? Is it RNA-seq? What RNA sizes did you select for?

ADD REPLY • link 11.4 years ago by Jelena Aleksic ▴ 920

0

Entering edit mode

It is RNA-Seq! and the Size selected is 300 bp!

ADD REPLY • link 11.4 years ago by k.nirmalraman ★ 1.1k