Using featureCounts in the Rsubread package I am getting 0 annotations.
I started from raw sequencing data and the Refseq genome and Refseq Genomic GTF files downloaded from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/ through the download assembly button on the side. I had the top option to RefSeq for both downloads and chose the Genomic Fasta (fna) for the genome file and Genomic GTF (.gtf) for the annotation file.
I then build the index using:
buildindex(basename="GRCm39", reference="GCF_000001635.27_GRCm39_genomic.fna")
This generated the following files.
GRCm39.00.b.array
GRCm39.00.b.tab
GRCm39.files
GRCm39.log
GRCm39.reads
I then aligned my sequencing to the index using:
align(index="GRCm39", readfile1=allRead1files, readfile2 = allRead2files)
This ran fine and generated 3 files per paired end sequencing eg. for the paired end files named 19143_S24_R1_153.fastq.gz and 19143_S24_R2_153.fastq.gz it output
19143_S24_R1_153.fastq.gz.subread.BAM
19143_S24_R1_153.fastq.gz.subread.BAM.indel.vcf
19143_S24_R1_153.fastq.gz.subread.BAM.summary
Total fragments : 6,925,412 ||
|| Mapped : 6,901,334 (99.7%) ||
|| Uniquely mapped : 6,277,058 ||
|| Multi-mapping : 624,276 ||
|| ||
|| Unmapped : 24,078 ||
|| ||
|| Properly paired : 5,346,711 ||
|| Not properly paired : 1,554,623 ||
|| Singleton : 87,673 ||
|| Chimeric : 71,303 ||
|| Unexpected strandness : 9,749 ||
|| Unexpected fragment length : 1,370,658 ||
|| Unexpected read order : 15,240 ||
|| ||
|| Indels : 27,554 ||
|| ||
|| Running time : 12.3 minutes
with ~76% of read being properly paired. This was consistent for all all samples.
But when I run:
fcs<-featureCounts(files=bam.files, annot.ext="GCF_000001635.27_GRCm39_genomic.gtf", isGTFAnnotationFile = T, GTF.featureType="exon,gene", GTF.attrType="gene_id", isPairedEnd = T, requireBothEndsMapped=T, countChimericFragments=F)
I get:
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 2 BAM files ||
|| ||
|| 19120_S1_R1_107.fastq.gz.subread.BAM ||
|| 19120_S1_R1_108.fastq.gz.subread.BAM ||
|| ||
|| Paired-end : yes ||
|| Count read pairs : yes ||
|| Annotation : GCF_000001635.27_GRCm39_genomic.gtf (GTF) ||
|| Dir for temp files : . ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Multimapping reads : counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
//================================= Running ==================================\\
|| ||
|| Load annotation file GCF_000001635.27_GRCm39_genomic.gtf ... ||
|| Features : 1571285 ||
|| Meta-features : 50563 ||
|| Chromosomes/contigs : 45 ||
|| ||
|| Process BAM file 19120_S1_R1_107.fastq.gz.subread.BAM... ||
|| Paired-end reads are included. ||
|| Total alignments : 6925412 ||
|| Successfully assigned alignments : 0 (0.0%) ||
|| Running time : 0.25 minutes ||
|| ||
|| Process BAM file 19120_S1_R1_108.fastq.gz.subread.BAM... ||
|| Paired-end reads are included. ||
|| Total alignments : 6790284 ||
|| Successfully assigned alignments : 0 (0.0%) ||
|| Running time : 0.24 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
Sorry for the long post but I am totally lost.
Based on other posts it might be that the annotation and BAM files have different chromosome names? I don't know why that would happen when they are downloaded from the same source at the same time. How would I check this?
Thanks in advance.
Are you sure you got GTF format files? NCBI generally makes GFF format files for annotation.
Yes the file is .gtf. Both were available through the download link on the side but Rsubread seems set up for the GTF files, even though it says you can use either.
Can you post a link for the file? I only see GFF files on the genome page.
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/
If you follow the above link there is a button on the right hand side that says Download Assembly
It creates a dropdown list the top is set to RefSeq and I selected GTF from the lower dropdown list.
Sorry I don't know how to link to that specifically.
Edit: I think the difference is I am on the assembly page.
I see it now. Thanks for pointing that out.
I suggest that you fire up a genome browser (IGV works well) and examine your alignment files. They may need to be sorted and indexed (if subread does not do that automatically). Check to see if your reads are aligning/piling up under exons.
OK I will have a look at that. Subread does not index the BAM files so I'll have to do that first.
Subread will sort and index the BAM files if you specify
sortReadsByCoordinates = TRUE
when runningalign()
.Rather than just reassure people that you got matching files, can you just show a couple of lines from your .bam and your .gtf, to confirm that the chromosome names are the same?
I am not sure how to look into the BAM files. I can open the .gtf with notepad but that doesn't work with the BAM files.
Notepad? Are you running all this without unix at all? You simply aren't going to be able to troubleshoot without knowing some basic unix commands. For starters, google "unix head", "unix pipes" and "samtools"
Unfortunately I was given a PC that I do not have administration rights on. My access to unix is only through the computing cluster and virtual machines.
I'll have to wait for IT to install IGV if I want to use it. I will have to check if the cluster has samtools on it, but I am sure it probably does.
Is there anything in R that would allow me to check the BAM chromosome names? So far I haven't been able to find anything.
Bonita: https://software.broadinstitute.org/software/igv/UserGuide