Rsubread FeatureCounts return 0.0% assigned
1
0
Entering edit mode
3.1 years ago
Bonita ▴ 30

Using featureCounts in the Rsubread package I am getting 0 annotations.

I started from raw sequencing data and the Refseq genome and Refseq Genomic GTF files downloaded from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/ through the download assembly button on the side. I had the top option to RefSeq for both downloads and chose the Genomic Fasta (fna) for the genome file and Genomic GTF (.gtf) for the annotation file.

I then build the index using:

buildindex(basename="GRCm39", reference="GCF_000001635.27_GRCm39_genomic.fna")

This generated the following files.

GRCm39.00.b.array
GRCm39.00.b.tab
GRCm39.files
GRCm39.log
GRCm39.reads

I then aligned my sequencing to the index using:

align(index="GRCm39", readfile1=allRead1files, readfile2 = allRead2files)

This ran fine and generated 3 files per paired end sequencing eg. for the paired end files named 19143_S24_R1_153.fastq.gz and 19143_S24_R2_153.fastq.gz it output

19143_S24_R1_153.fastq.gz.subread.BAM
19143_S24_R1_153.fastq.gz.subread.BAM.indel.vcf
19143_S24_R1_153.fastq.gz.subread.BAM.summary

 Total fragments : 6,925,412                                    ||
||                      Mapped : 6,901,334 (99.7%)                            ||
||             Uniquely mapped : 6,277,058                                    ||
||               Multi-mapping : 624,276                                      ||
||                                                                            ||
||                    Unmapped : 24,078                                       ||
||                                                                            ||
||             Properly paired : 5,346,711                                    ||
||         Not properly paired : 1,554,623                                    ||
||                   Singleton : 87,673                                       ||
||                    Chimeric : 71,303                                       ||
||       Unexpected strandness : 9,749                                        ||
||  Unexpected fragment length : 1,370,658                                    ||
||       Unexpected read order : 15,240                                       ||
||                                                                            ||
||                      Indels : 27,554                                       ||
||                                                                            ||
||                Running time : 12.3 minutes        

with ~76% of read being properly paired. This was consistent for all all samples.

But when I run:

fcs<-featureCounts(files=bam.files, annot.ext="GCF_000001635.27_GRCm39_genomic.gtf", isGTFAnnotationFile = T, GTF.featureType="exon,gene", GTF.attrType="gene_id", isPairedEnd = T, requireBothEndsMapped=T, countChimericFragments=F)

I get:

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 2 BAM files                                      ||
||                                                                            ||
||                           19120_S1_R1_107.fastq.gz.subread.BAM             ||
||                           19120_S1_R1_108.fastq.gz.subread.BAM             ||
||                                                                            ||
||              Paired-end : yes                                              ||
||        Count read pairs : yes                                              ||
||              Annotation : GCF_000001635.27_GRCm39_genomic.gtf (GTF)        ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||




//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file GCF_000001635.27_GRCm39_genomic.gtf ...               ||
||    Features : 1571285                                                      ||
||    Meta-features : 50563                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file 19120_S1_R1_107.fastq.gz.subread.BAM...                   ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 6925412                                              ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.25 minutes                                             ||
||                                                                            ||
|| Process BAM file 19120_S1_R1_108.fastq.gz.subread.BAM...                   ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 6790284                                              ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.24 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||

Sorry for the long post but I am totally lost.

Based on other posts it might be that the annotation and BAM files have different chromosome names? I don't know why that would happen when they are downloaded from the same source at the same time. How would I check this?

Thanks in advance.

rsubread featurecounts bam rna-seq r • 4.4k views
ADD COMMENT
0
Entering edit mode

Are you sure you got GTF format files? NCBI generally makes GFF format files for annotation.

ADD REPLY
0
Entering edit mode

Yes the file is .gtf. Both were available through the download link on the side but Rsubread seems set up for the GTF files, even though it says you can use either.

ADD REPLY
0
Entering edit mode

Can you post a link for the file? I only see GFF files on the genome page.

ADD REPLY
0
Entering edit mode

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/

If you follow the above link there is a button on the right hand side that says Download Assembly

It creates a dropdown list the top is set to RefSeq and I selected GTF from the lower dropdown list.

Sorry I don't know how to link to that specifically.

Edit: I think the difference is I am on the assembly page.

ADD REPLY
0
Entering edit mode

I see it now. Thanks for pointing that out.

I suggest that you fire up a genome browser (IGV works well) and examine your alignment files. They may need to be sorted and indexed (if subread does not do that automatically). Check to see if your reads are aligning/piling up under exons.

ADD REPLY
0
Entering edit mode

OK I will have a look at that. Subread does not index the BAM files so I'll have to do that first.

ADD REPLY
1
Entering edit mode

Subread will sort and index the BAM files if you specify sortReadsByCoordinates = TRUE when running align().

ADD REPLY
0
Entering edit mode

Rather than just reassure people that you got matching files, can you just show a couple of lines from your .bam and your .gtf, to confirm that the chromosome names are the same?

ADD REPLY
0
Entering edit mode

I am not sure how to look into the BAM files. I can open the .gtf with notepad but that doesn't work with the BAM files.

ADD REPLY
0
Entering edit mode

Notepad? Are you running all this without unix at all? You simply aren't going to be able to troubleshoot without knowing some basic unix commands. For starters, google "unix head", "unix pipes" and "samtools"

ADD REPLY
0
Entering edit mode

Unfortunately I was given a PC that I do not have administration rights on. My access to unix is only through the computing cluster and virtual machines.

I'll have to wait for IT to install IGV if I want to use it. I will have to check if the cluster has samtools on it, but I am sure it probably does.

Is there anything in R that would allow me to check the BAM chromosome names? So far I haven't been able to find anything.

ADD REPLY
1
Entering edit mode
3.1 years ago
Gordon Smyth ★ 7.7k

If there was a mismatch in the chromosome names between the FASTA/BAM files on one hand and the GTF file the other, then Rsubread would have already told you about that as part of the featureCounts output. You should see lines like this:

|| Chromosomes/contigs in annotation but not in index :                       ||
||    NT_166281.1                                                             ||
||    NT_187056.1                                                             ||
||    NT_187054.1                                                             ||

and

|| Chromosomes/contigs in index but not in annotation :                       ||
||    chr4_JH584292_random                                                    ||
||    chr5_JH584297_random                                                    ||

I wonder whether there could have been lines like this that you have truncated from the output shown in your question?

BTW, there are plenty of tools available in Rsubread and R for Windows to read FASTQ, GTF and BAM files. The Rsubread and Rsamtools packages provide many R functions that deliver equivalent functionality to what you might find in samtools for Unix.

ADD COMMENT
0
Entering edit mode

Thank you for the advice about the R packages, I will download them and play around with that.

There were no lines like that in the output. It finished fine, I just didn't copy the last line of ============================ from the file.

I think I will give up on trying to use gtf file and just use the built in annotations.

Thanks for all the help though.

ADD REPLY

Login before adding your answer.

Traffic: 1612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6