Generating count table
1
1
Entering edit mode
5.6 years ago
gokberk ▴ 90

Hi all,

I've been trying to calculate TPM values for a count table, thus I've been using biomaRt package to obtain gene lengths. However, I'm using an NCBI annotation file and my count table has gene_name's as row names (I'm using gene_name instead of gene_id, because I receive an error otherwise). Thus, when I try to merge gene length table and my count table, I get an empty data frame (I assume that the gene names in my annotation file and biomaRt provides are not matching or maybe I'm doing wrong something else). Here is how my code looks:

rsubread.counts<-featureCounts(files=E08_bam,
                               annot.ext="/media/gokberk/DATA/Thesis/Data/Annotations/Macaca_fascicularis_5.0_genomic.gtf",
                               isGTFAnnotationFile = TRUE,
                               GTF.featureType="exon",
                               GTF.attrType="gene_name",
                               useMetaFeatures = TRUE,
                               allowMultiOverlap = FALSE)

cnt<-rsubread.counts$counts

ensembl<-useMart("ensembl",dataset="mfascicularis_gene_ensembl",host="www.ensembl.org")

geneLength<-getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","external_gene_name","transcript_length"),
                  filter = "external_gene_name",
                  values = rownames(cnt),
                  mart = ensembl)

merged<-geneLength %>%
        group_by(ensembl_gene_id,external_gene_name) %>%
        summarise(Length= round(mean(transcript_length))) %>%
        rename(GeneID=external_gene_name)

countdata<-data.frame(GeneID=rownames(cnt),cnt)

countdata<- merged %>%
            dplyr::select(GeneID, Length) %>%
            inner_join(countdata)

After running these, head(countdata) and dim(countdata) commands give following outputs:

# A tibble: 0 x 206
# Groups:   GeneID [0]
# … with 206 variables: GeneID <chr>, Length <dbl>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930102.bam <int>,
#   X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930103.bam <int>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930104.bam <int>,
#   X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930105.bam <int>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930106.bam <int>,
#   X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930107.bam <int>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930108.bam <int>,
#   X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930109.bam <int>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930110.bam <int>,
#   X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930111.bam <int>, X.media.gokberk.DATA.Thesis.Data.BAM_files.E08.SRR2930112.bam <int>,...

[1]   0 206

I was wondering if the issue here is actually the discrepancy between my annotation file and gene names provided by getBM and if so how could I fix it. Thanks a lot!

Cheers, Gökberk

RNA-Seq • 1.4k views
ADD COMMENT
4
Entering edit mode
5.6 years ago
Benn 8.3k

The featureCounts function should give you the correct length of the transcript/gene. There should be a column named length.

ADD COMMENT
0
Entering edit mode

Yes, I found it and solved my problem, thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6