Question

Rsubread FeatureCounts return 0.0% assigned

0

Entering edit mode

3.1 years ago

Bonita ▴ 30

Using featureCounts in the Rsubread package I am getting 0 annotations.

I started from raw sequencing data and the Refseq genome and Refseq Genomic GTF files downloaded from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/ through the download assembly button on the side. I had the top option to RefSeq for both downloads and chose the Genomic Fasta (fna) for the genome file and Genomic GTF (.gtf) for the annotation file.

I then build the index using:

buildindex(basename="GRCm39", reference="GCF_000001635.27_GRCm39_genomic.fna")

This generated the following files.

GRCm39.00.b.array
GRCm39.00.b.tab
GRCm39.files
GRCm39.log
GRCm39.reads

I then aligned my sequencing to the index using:

align(index="GRCm39", readfile1=allRead1files, readfile2 = allRead2files)

This ran fine and generated 3 files per paired end sequencing eg. for the paired end files named 19143_S24_R1_153.fastq.gz and 19143_S24_R2_153.fastq.gz it output

19143_S24_R1_153.fastq.gz.subread.BAM
19143_S24_R1_153.fastq.gz.subread.BAM.indel.vcf
19143_S24_R1_153.fastq.gz.subread.BAM.summary

 Total fragments : 6,925,412                                    ||
||                      Mapped : 6,901,334 (99.7%)                            ||
||             Uniquely mapped : 6,277,058                                    ||
||               Multi-mapping : 624,276                                      ||
||                                                                            ||
||                    Unmapped : 24,078                                       ||
||                                                                            ||
||             Properly paired : 5,346,711                                    ||
||         Not properly paired : 1,554,623                                    ||
||                   Singleton : 87,673                                       ||
||                    Chimeric : 71,303                                       ||
||       Unexpected strandness : 9,749                                        ||
||  Unexpected fragment length : 1,370,658                                    ||
||       Unexpected read order : 15,240                                       ||
||                                                                            ||
||                      Indels : 27,554                                       ||
||                                                                            ||
||                Running time : 12.3 minutes

with ~76% of read being properly paired. This was consistent for all all samples.

But when I run:

fcs<-featureCounts(files=bam.files, annot.ext="GCF_000001635.27_GRCm39_genomic.gtf", isGTFAnnotationFile = T, GTF.featureType="exon,gene", GTF.attrType="gene_id", isPairedEnd = T, requireBothEndsMapped=T, countChimericFragments=F)

I get:

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 2 BAM files                                      ||
||                                                                            ||
||                           19120_S1_R1_107.fastq.gz.subread.BAM             ||
||                           19120_S1_R1_108.fastq.gz.subread.BAM             ||
||                                                                            ||
||              Paired-end : yes                                              ||
||        Count read pairs : yes                                              ||
||              Annotation : GCF_000001635.27_GRCm39_genomic.gtf (GTF)        ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||




//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file GCF_000001635.27_GRCm39_genomic.gtf ...               ||
||    Features : 1571285                                                      ||
||    Meta-features : 50563                                                   ||
||    Chromosomes/contigs : 45                                                ||
||                                                                            ||
|| Process BAM file 19120_S1_R1_107.fastq.gz.subread.BAM...                   ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 6925412                                              ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.25 minutes                                             ||
||                                                                            ||
|| Process BAM file 19120_S1_R1_108.fastq.gz.subread.BAM...                   ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 6790284                                              ||
||    Successfully assigned alignments : 0 (0.0%)                             ||
||    Running time : 0.24 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||

Sorry for the long post but I am totally lost.

Based on other posts it might be that the annotation and BAM files have different chromosome names? I don't know why that would happen when they are downloaded from the same source at the same time. How would I check this?

Thanks in advance.

rsubread featurecounts bam rna-seq r • 4.4k views

ADD COMMENT • link updated 3.1 years ago by GenoMax 147k • written 3.1 years ago by Bonita ▴ 30

0

Entering edit mode

Are you sure you got GTF format files? NCBI generally makes GFF format files for annotation.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

Yes the file is .gtf. Both were available through the download link on the side but Rsubread seems set up for the GTF files, even though it says you can use either.

ADD REPLY • link 3.1 years ago by Bonita ▴ 30

0

Entering edit mode

Can you post a link for the file? I only see GFF files on the genome page.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/

If you follow the above link there is a button on the right hand side that says Download Assembly

It creates a dropdown list the top is set to RefSeq and I selected GTF from the lower dropdown list.

Sorry I don't know how to link to that specifically.

Edit: I think the difference is I am on the assembly page.

ADD REPLY • link 3.1 years ago by Bonita ▴ 30

0

Entering edit mode

I see it now. Thanks for pointing that out.

I suggest that you fire up a genome browser (IGV works well) and examine your alignment files. They may need to be sorted and indexed (if subread does not do that automatically). Check to see if your reads are aligning/piling up under exons.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

OK I will have a look at that. Subread does not index the BAM files so I'll have to do that first.

ADD REPLY • link 3.1 years ago by Bonita ▴ 30

1

Entering edit mode

Subread will sort and index the BAM files if you specify sortReadsByCoordinates = TRUE when running align().

ADD REPLY • link 3.1 years ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

Rather than just reassure people that you got matching files, can you just show a couple of lines from your .bam and your .gtf, to confirm that the chromosome names are the same?

ADD REPLY • link 3.1 years ago by swbarnes2 14k

0

Entering edit mode

I am not sure how to look into the BAM files. I can open the .gtf with notepad but that doesn't work with the BAM files.

ADD REPLY • link 3.1 years ago by Bonita ▴ 30

0

Entering edit mode

Notepad? Are you running all this without unix at all? You simply aren't going to be able to troubleshoot without knowing some basic unix commands. For starters, google "unix head", "unix pipes" and "samtools"

ADD REPLY • link 3.1 years ago by swbarnes2 14k

0

Entering edit mode

Unfortunately I was given a PC that I do not have administration rights on. My access to unix is only through the computing cluster and virtual machines.

I'll have to wait for IT to install IGV if I want to use it. I will have to check if the cluster has samtools on it, but I am sure it probably does.

Is there anything in R that would allow me to check the BAM chromosome names? So far I haven't been able to find anything.

ADD REPLY • link 3.1 years ago by Bonita ▴ 30

0

Entering edit mode

Bonita: https://software.broadinstitute.org/software/igv/UserGuide

ADD REPLY • link 3.1 years ago by GenoMax 147k

score 1 · Answer 1 · 2021-10-17

If there was a mismatch in the chromosome names between the FASTA/BAM files on one hand and the GTF file the other, then Rsubread would have already told you about that as part of the featureCounts output. You should see lines like this:

|| Chromosomes/contigs in annotation but not in index :                       ||
||    NT_166281.1                                                             ||
||    NT_187056.1                                                             ||
||    NT_187054.1                                                             ||

and

|| Chromosomes/contigs in index but not in annotation :                       ||
||    chr4_JH584292_random                                                    ||
||    chr5_JH584297_random                                                    ||

I wonder whether there could have been lines like this that you have truncated from the output shown in your question?

BTW, there are plenty of tools available in Rsubread and R for Windows to read FASTQ, GTF and BAM files. The Rsubread and Rsamtools packages provide many R functions that deliver equivalent functionality to what you might find in samtools for Unix.