Dear All,
Probably this is a novice question.
I have single end reads of size 100bp from Illumina TruSeq sequencing. I am using mouse genome build mm9 from TopHat Index and annotation downloads.
The library is 16 Million reads
My tophat command is as follows:
tophat -p 4 -N 3 --read-gap-length 3 --read-edit-dist 3 --output-dir <path> <genome_path> path/to/input.fasta
HTSeq Count
python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_id -t exon accepted_hits.sam /mm9/genes.gtf > counts.txt
After HTSeq Count
no_feature 4973501
ambiguous 125622
too_low_aQual 0
not_aligned 0
alignment_not_unique 5620063
The HTSeq output has the above statistics... Is it normal to have such kind of numbers for no_feature
and alignment_not_unique
for single end sequencing. Is there something that can be done to improve this statistics.
Well as an extension to the above question, what happens if the features that are not unique if counted for both genes in Differential expression analysis? Does this cause any bias?
Thanks in advance!
That's a pretty high level of
no_feature
for the mouse genome. Did you purify for anything unusual at some point? Having around 10%no_feature
isn't unheard of, but over 25% is kind of over the top. You might want to look where some of thoseno_feature
reads are aligning. Perhaps you have a bunch of DNA contamination or just really high amounts of pre-mRNAs?Just to agree with @dpryan, take a look at your data in a browser. You'll likely learn a lot, particularly if you are relatively new to these data.
I would definitely like to look into these reads in a genome browser. Thanks for that direction! :)
There was no unusual purification method used.
Maybe your question should be: "Can I improve ..." in stead of "How can I improve". Maybe doing single-end sequencing doesn't fit your study/sample very well.
What's the experiment? Is it RNA-seq? What RNA sizes did you select for?
It is RNA-Seq! and the Size selected is 300 bp!