HTSeq-Count: no_feature too high?
0
0
Entering edit mode
2.1 years ago
sea.joson ▴ 10

Hello!

I am learning how to do differential expression. I was given transcriptome data and I am fortunate that the sample already has an established annotated genome. So far, I aligned the transcripts to the reference genome via STAR and I've had good mapping stats (>95%). Now, I'm quantifying the counts with HTSeq-Count. I expected it to be a relatively smooth process, but the results have been bugging me.

Here is the line I used:

htseq-count -f bam -r pos -t exon $RNA/AN0_1.Aligned.sortedByCoord.out.bam $GEN > AN0_1.no.exon.counts

The results:

__no_feature    14805713
__ambiguous     4766268
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  113292

I feel like that's a high value for no_features right?

I've tried playing around with the strandedness (-s), but all options give similar results. I also went back and made sure the GTF file has the gene_id attribute. I also tried using a GFF file instead.

Might it also have to do with this warning sign from the err file?

Warning: Mate pairing was ambiguous for 23164 records; mate key for first such record: ('A00564:87:HJKFCDSXX:2:2205:13838:9330', 'first', 'CP059858.1', 91661, 'CP059858.1', 91831, 317).

Any help will be much appreciated!

htseq-count • 3.3k views
ADD COMMENT
0
Entering edit mode

Warning: Mate pairing was ambiguous

It makes me think you have paired-end reads that have been trimmed inappropriately without using paired-end mode.

Also, can you try running it again with -t gene and let me know what results you get?

ADD REPLY
0
Entering edit mode

This is run with -t gene

__no_feature    14786206
__ambiguous     4691
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  113292

*Warning: Mate pairing was ambiguous for 23164 records; mate key for first such record: ('A00564:87:HJKFCDSXX:2:2205:13838:9330', 'first', 'CP059858.1', 91661, 'CP059858.1', 91831, 317).*

I'm not super certain about the trimming because when I got the files, I think they've already been pre-processed a bit. I checked them out using fastqc and they seem fine.

https://imgur.com/n6Zyg1q https://imgur.com/h6JcYzP

ADD REPLY
0
Entering edit mode

Do you have identical number of reads in R1/R2 files?

Can you run the following program from BBMap suite on each file and show the output

$ readlength.sh -Xmx2g in=your_file.fastq.gz
Executing jgi.MakeLengthHistogram [-Xmx2g, in=your_file.fastq.gz]


Processed 36129 reads.
#Reads: 36129
#Bases: 10838700
#Max:   300
#Min:   300
#Avg:   300.0
#Median:        300
#Mode:  300
#Std_Dev:       0.0
ADD REPLY
0
Entering edit mode

readlength.sh in=AN0_1_1.fq.gz in2=AN0_1_2.fq.gz

Processed 56800750 reads.
#Reads: 56800750
#Bases: 8520112500
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
#Read Length Histogram:
#Length reads   pct_reads       cum_reads       cum_pct_reads   bases   pct_bases       cum_bases       cum_pct_bases
150     56800750        100.000%        56800750        100.000%        8520112500      100.000%        8520112500      100.000%
ADD REPLY
0
Entering edit mode

Can you run the command independently on the two files?

ADD REPLY
0
Entering edit mode

readlength.sh in=AN0_1_1.fq.gz

Processed 28400375 reads.
#Reads: 28400375
#Bases: 4260056250
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
#Read Length Histogram:
#Length reads   pct_reads       cum_reads       cum_pct_reads   bases   pct_bases       cum_bases       cum_pct_bases
150     28400375        100.000%        28400375        100.000%        4260056250      100.000%        4260056250      100.000%
Time:   67.509 seconds.

readlength.sh in=AN0_1_2.fq.gz

Processed 28400375 reads.
#Reads: 28400375
#Bases: 4260056250
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
#Read Length Histogram:
#Length reads   pct_reads       cum_reads       cum_pct_reads   bases   pct_bases       cum_bases       cum_pct_bases
150     28400375        100.000%        28400375        100.000%        4260056250      100.000%        4260056250      100.000%
Time:   69.977 seconds.
ADD REPLY
0
Entering edit mode

These files appear to be non-trimmed and intact.

Did you make sure that these actually have reads from R1 and R2 by inspecting the headers? Hopefully you don't have the same data in the two files.

Also inspect the aligned BAM in IGV to see where the reads are aligning (i.e. are they piling up on exons etc). Do you have rRNA gene models in your GTF? Perhaps you may have a lot of rRNA contamination.

ADD REPLY
0
Entering edit mode

Oh. I did not trim my reads. Do you recommend I try again but with trimmed reads instead? Also, may I ask how you are able to tell that the files "appear to be non-trimmed and intact?"

I did check the R1 and R2 and they are not the same data.

I also went ahead and checked the bam file in IGV. Unfortunately, I do no yet know how to fully interpret what I was seeing.

ADD REPLY
0
Entering edit mode

Do you recommend I try again but with trimmed reads instead?

No need. STAR should be able to handle any adapter sequences present by soft-clipping them.

Also, may I ask how you are able to tell that the files "appear to be non-trimmed and intact?"

You can see that the length of reads is identical (150 bp ) in both files and number of bases is also identical. If reads are trimmed to remove adapters there will be a range of sizes present and the number of bases will likely be different in two files.

I do no yet know how to fully interpret what I was seeing.

Find a smallish gene in IGV (which shows exon structure) and post a screen shot of what the alignments look like.

ADD REPLY
0
Entering edit mode

Is this right? https://imgur.com/hXGSf4b

ADD REPLY
0
Entering edit mode

That looks right. Reads piling up under exons and a good number too.

If you are open to trying another counting program out can you try featureCounts? It is part of a package called subread (LINK). A tutorial is available: https://subread.sourceforge.net/featureCounts.html

ADD REPLY
0
Entering edit mode

Yeah I think this is a next move for me.

ADD REPLY
0
Entering edit mode

Finally got to try featureCounts. It seems to be better, correct?

Status  AN0_1.Aligned.sortedByCoord.out.bam
Assigned        45141493
Unassigned_Unmapped     0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 226584
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   8469135
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    24850
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

40M reads assigned is great.

ADD REPLY
0
Entering edit mode

Thank you both for all the help! Now I just gotta figure why that was the case in the first place.

ADD REPLY
0
Entering edit mode

From where did you download the .gtf file?

ADD REPLY
0
Entering edit mode

sea.joson: And to add another point to this question, do the chromosome names in your reference and GTF file match perfectly.

ADD REPLY
0
Entering edit mode

Yep. I checked and made sure that the chromosome names used are the same across the board.

ADD REPLY
0
Entering edit mode

https://www.ncbi.nlm.nih.gov/genome/360?genome_assembly_id=968151

Fortunately, the genome has already been worked on. The samples I'm working on are transcriptomes of the same strain grown in different conditions.

ADD REPLY
0
Entering edit mode

Would you give a try with another .gtf for the same organism from another source? Maybe Ensembl?

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6