Hello!
I am learning how to do differential expression. I was given transcriptome data and I am fortunate that the sample already has an established annotated genome. So far, I aligned the transcripts to the reference genome via STAR and I've had good mapping stats (>95%). Now, I'm quantifying the counts with HTSeq-Count. I expected it to be a relatively smooth process, but the results have been bugging me.
Here is the line I used:
htseq-count -f bam -r pos -t exon $RNA/AN0_1.Aligned.sortedByCoord.out.bam $GEN > AN0_1.no.exon.counts
The results:
__no_feature 14805713
__ambiguous 4766268
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 113292
I feel like that's a high value for no_features right?
I've tried playing around with the strandedness (-s), but all options give similar results. I also went back and made sure the GTF file has the gene_id attribute. I also tried using a GFF file instead.
Might it also have to do with this warning sign from the err file?
Warning: Mate pairing was ambiguous for 23164 records; mate key for first such record: ('A00564:87:HJKFCDSXX:2:2205:13838:9330', 'first', 'CP059858.1', 91661, 'CP059858.1', 91831, 317).
Any help will be much appreciated!
It makes me think you have paired-end reads that have been trimmed inappropriately without using paired-end mode.
Also, can you try running it again with
-t gene
and let me know what results you get?This is run with -t gene
I'm not super certain about the trimming because when I got the files, I think they've already been pre-processed a bit. I checked them out using fastqc and they seem fine.
https://imgur.com/n6Zyg1q https://imgur.com/h6JcYzP
Do you have identical number of reads in R1/R2 files?
Can you run the following program from BBMap suite on each file and show the output
readlength.sh in=AN0_1_1.fq.gz in2=AN0_1_2.fq.gz
Can you run the command independently on the two files?
readlength.sh in=AN0_1_1.fq.gz
readlength.sh in=AN0_1_2.fq.gz
These files appear to be non-trimmed and intact.
Did you make sure that these actually have reads from R1 and R2 by inspecting the headers? Hopefully you don't have the same data in the two files.
Also inspect the aligned BAM in IGV to see where the reads are aligning (i.e. are they piling up on exons etc). Do you have rRNA gene models in your GTF? Perhaps you may have a lot of rRNA contamination.
Oh. I did not trim my reads. Do you recommend I try again but with trimmed reads instead? Also, may I ask how you are able to tell that the files "appear to be non-trimmed and intact?"
I did check the R1 and R2 and they are not the same data.
I also went ahead and checked the bam file in IGV. Unfortunately, I do no yet know how to fully interpret what I was seeing.
No need. STAR should be able to handle any adapter sequences present by soft-clipping them.
You can see that the length of reads is identical (150 bp ) in both files and number of bases is also identical. If reads are trimmed to remove adapters there will be a range of sizes present and the number of bases will likely be different in two files.
Find a smallish gene in IGV (which shows exon structure) and post a screen shot of what the alignments look like.
Is this right? https://imgur.com/hXGSf4b
That looks right. Reads piling up under exons and a good number too.
If you are open to trying another counting program out can you try
featureCounts
? It is part of a package calledsubread
(LINK). A tutorial is available: https://subread.sourceforge.net/featureCounts.htmlYeah I think this is a next move for me.
Finally got to try featureCounts. It seems to be better, correct?
https://imgur.com/fxxd4L2
40M reads assigned is great.
Thank you both for all the help! Now I just gotta figure why that was the case in the first place.
From where did you download the .gtf file?
sea.joson: And to add another point to this question, do the chromosome names in your reference and GTF file match perfectly.
Yep. I checked and made sure that the chromosome names used are the same across the board.
https://www.ncbi.nlm.nih.gov/genome/360?genome_assembly_id=968151
Fortunately, the genome has already been worked on. The samples I'm working on are transcriptomes of the same strain grown in different conditions.
Would you give a try with another .gtf for the same organism from another source? Maybe Ensembl?