I am attempting to do an RNA-seq analysis with 3 SRA files, using HISAT2 to align and then featureCount to quantify transcripts. Organism is Arabidopsis thaliana. I'm making this post because I'm having a lot of trouble trying to quantify transcripts with featureCount, where all of the reads are unassigned and the program output contains 0 assigned reads. Here's what I've done so far:
First I downloaded the tair10 reference genome which I built into an index with:
hisat2-build -p 16 genome.fa genome
I entered my paired-end reads in with the command:
hisat2 -x <path_to_index> -1 trimmed_paired_SRR_1.fastq -2 trimmed_paired_SRR_2.fastq -U trimmed_unpaired_SRR_1.fastq,trimmed_unpaired_SRR_2.fastq -S hit_SRR
This resulted in SAM output that looked like this (feels important to include this though I'm not sure whether there's a better way to format it for comprehensibility):
more hit_SRR.sam
@HD VN:1.0 SO:unsorted
@SQ SN:1 LN:30427671
@SQ SN:2 LN:19698289
@SQ SN:3 LN:23459830
@SQ SN:4 LN:18585056
@SQ SN:5 LN:26975502
@SQ SN:mitochondria LN:366924
@SQ SN:chloroplast LN:154478
@PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"hisat2 -x <path_to_index> -1 trimmed_paired_SRR_1.fastq -2 trimmed_paired_SRR_2.fastq -U trimmed_unpaired_SRR_1.fastq,trimmed_unpaired_SRR_2.fastq -S hit_SRR"
SRR.1 83 1 28116855 60 17M163N133M = 28116833 -291 TTCAGTCCATGGAACTCCTCGTTTCCTCTCGATTTTGCCGTTAGAAGAAGACAT
GGGAAGAGCTTCATCCGCCGAGGCGTAACCGGCCGGGGTTTTGTTCATATCTTGTTCATCTCTATCTTCGCCGTCGATCTTTGGAGTCTCCGCCGT FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:1
50 YS:i:0 YT:Z:CP XS:A:- NH:i:1
SRR.1 163 1 28116833 60 39M163N111M = 28116855 291 GCAAGAACAGCTTGTGTTCTTCTTCAGTCCATGGAACTCCTCGTTTCCTCTCGA
TTTTGCCGTTAGAAGAAGACATGGGAAGAGCTTCATCCGCCGAGGCGTAACCGGCCGGGGTTTTGTTCATATCTTGTTCATCTCTATCTTCGCCGT FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFF
FFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:1
50 YS:i:0 YT:Z:CP XS:A:- NH:i:1
SRR.2 99 5 20218748 60 1S135M76N14M = 20218806 167 TTATCTTAAAGCGAAATAGTATAGTGGAAAGCTTAGACTTTGGTACAGACGTTA
GCAGCCAGTTGGAGTTTGTGGATTTACAATACAATGAAATAACTGATTATAAACCATCAGCTAATAAAGTCCTCCAAGTAATATTGGCAAATAATC FFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:1
49 YS:i:0 YT:Z:CP XS:A:+ NH:i:1
Then I converted the output SAM to sorted BAM with
samtools view -S -b hit_SRR.sam > hit_SRR.bam
samtools sort hit_SRR.bam -o hit_SRR.sorted.bam
Then I tried to quantify my transcripts with featureCounts, which is where I'm having a lot of trouble.
First, I understand featureCounts requires a full annotation file for the reference genes in gtf input and I downloaded the Araport11 annotations gtf that is based on the TAIR10 genome that I used for my index. This file looks like this:
head araport11_genes.gtf
Chr1 Araport11 five_prime_UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 exon 3631 3913 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 transcript 3631 5899 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 CDS 3760 3913 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 exon 3996 4276 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 CDS 3996 4276 . + 2 transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 exon 4486 4605 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 CDS 4486 4605 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 exon 4706 5095 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1 Araport11 CDS 4706 5095 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010";
So, my usage of featureCounts is:
featureCounts -T 4 -p -a araport11_genes.gtf -o filename.featureCounts.txt hit_SRR1.sorted.bam hit_SRR2.sorted.bam hit_SRR3.sorted.bam
This yields the output:
more filename.featureCounts.txt.summary
Status hit_SRR1.sorted.bam hit_SRR2sorted.bam hit_SRR3.sorted.bam
Assigned 0 0 0
Unassigned_Ambiguity 0 0 0
Unassigned_MultiMapping 3696122 1999717 1862896
Unassigned_NoFeatures 26964381 30128594 20699201
Unassigned_Unmapped 151924 155922 111851
Unassigned_MappingQuality 0 0 0
Unassigned_FragmentLength 0 0 0
Unassigned_Chimera 0 0 0
Unassigned_Secondary 0 0 0
Unassigned_Nonjunction 0 0 0
Unassigned_Duplicate 0 0 0
And the more extended output:
more <outputfile>
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.5.0-p2
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 3 BAM files ||
|| P hit_SRR1.sorted.bam ||
|| P hit_SRR2.sorted.bam ||
|| P hit_SRR3.sorted.bam ||
|| ||
|| Output file : filename.featureCounts-rfmt_options.txt ||
|| Summary : filename.featureCounts-rfmt_options.txt.s ... ||
|| Annotation : araport11_genes.gtf (GTF) ||
|| ||
|| Threads : 4 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file araport11_genes.gtf ... ||
|| Features : 323164 ||
|| Meta-features : 38113 ||
|| Chromosomes/contigs : 7 ||
|| ||
|| Process BAM file hit_SRR1.sorted.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Both single-end and paired-end reads were found. ||
|| Total fragments : 30812427 ||
|| Successfully assigned fragments : 0 (0.0%) ||
|| Running time : 0.64 minutes ||
|| ||
|| Process BAM file hit_SRR2.sorted.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Total fragments : 32284233 ||
|| Successfully assigned fragments : 0 (0.0%) ||
|| Running time : 0.65 minutes ||
|| ||
|| Process BAM file hit_SRR3.sorted.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Total fragments : 22673948 ||
|| Successfully assigned fragments : 0 (0.0%) ||
|| Running time : 0.45 minutes ||
|| ||
|| Read assignment finished. ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
So, can anyone help me understand why I'm getting exactly 0 assigned reads?
I think it's a somewhat simple fix that I haven't realized yet, as it's not a problem of only some reads being assigned- the problem is that zero are being assigned. Yet some of them are mapping to multiple places and most are mapping to no features- how could that be happening? I don't see how that's possible given my SAM that is specifying fragment aligning locations and my reference annotation GTF that specifies the annotation location information?
It seems likely to me that I have:
- Failed to provide any readers with sufficient info about some facet of this that I don't realize is important- please let me know what details might help anyone that can give me advice.
- Overlooked some simple solution that's causing all of my problems (optimistic).
Finally, I have tried googling this problem. Relevant threads that didn't help me solve my problem include:
- Question: featureCounts: 0% successfully assigned fragments on multiple PE .BAM files
- Question: Extremely low assigned reads in featureCounts
- Question: Rsubread featureCounts returns no matches
- Question: Featurecounts for paired end RNA-seq reads not generating counts
Thanks for any help or insights!
not sure this is the cause of the issue(s), but in the
samtools
sort step, make sure you order them on name rather than position.Hi Lieven.Sterck, I hadn't heard about this before. I googled the question and found this biostars thread, but I don't think I understand the downstream significance of sorting by chromosomal position vs read name. It seems like sometimes it's better one way and other times it's better the other. Am I right in supposing that for differential expression analysis, it's better to sort by read name because it retains read pairing, which is important for only counting a fragment once rather than twice if the reads are separated? Thanks for your comment
I think Brian's answer is pretty clear though.
in a nutshell: if the goal is to display reads on a chromosome fragment (eg in a genome browser or such) you want position sorting as it is important (or more efficient) that the browser can quickly access all reads for a certain position. If the goal is to check for correct pairing (eg. DEG analysis, counting the numbers of correct mapped reads, ... ) then you are better of with name sorting as this makes the whole process more efficient (same read names will be 'next to each other' in the sorted file).
You do not have the risk of counting them twice (at least not for most modern tools) but as said it's just more efficient, moreover some tools require it because they don't spend the time looking for the 'mate' further down the file