Before getting to your question, it's generally a bad idea to count by gene symbol (or even UCSC ID). They're not actually unique, so you end up counting multiple things (the Ensembl annotations make much more sense!).
This is more of a work around than a direct answer, since I've never looked to see where the ucscTableQuery() function actually sources its data. Having said that, once it gets the data from UCSC, the getlength() function in goseq just adds up the exons. Assuming you have a GTF and fasta file for mm10 sitting around somewhere (quite likely if you're doing RNAseq), then something like the following will work. This also outputs %GC content, in case you ever need to use the cqn package. The original version of this used gene_id to calculate everything over, so I've quickly changed the instances of that to gene_name. I've not debugged the version pasted below, but it should more or less work.
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
GTFfile = "something.gtf"
FASTAfile = "something.fa"
#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="mm10", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_name))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_name <- rep(names(grl), elementLengths(grl))
#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)
#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
nGCs = sum(elementMetadata(x)$nGCs)
width = sum(elementMetadata(x)$widths)
c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_name), calc_GC_length))
colnames(output) <- c("Length", "GC")
write.table(output, file="GC_lengths.tsv", sep="\t")
How would you load the
GC_lengths.tsv
file intogoseq
?read.delim()
, as you would any other tab-separated file.Would you be able to share more of your code how you used
goseq
because here it seem they did it differently but their code did not work for me?They just did some extra work because they don't know how to use R, it's the same otherwise. For goseq, just follow its manual.
Unfortunately, in the manual I found only the explanation for : * 5.1 Length data format (your code is working for it) * 5.2 Category mapping format but not how to actually use it for an non model species.
You put the lengths in
nullp()
as thebias.data=
parameter.Should I merge
sp0000001-mRNA-x:exon:xxx
together in order to getsp0000001
which would represent the gene length?My gff file looks like this:
My code was made for GTF files, you'll need to change it to match yours.
Would you be able to share few lines of your GTF file and few lines of the output file created by your script?
Download any mammalian GTF file from Ensembl for a good example.
Apparently,
nullp
expects a vector for bias.data, not adata.frame
. What is the best way to convert it to a vector after I read it withread.delim()
?Give it the appropriate vector from the dataframe (e.g.,
d$length
). I should note that you might have to either (A) ensure the order of the dataframe/length-vector matches the rest of the data you input and/or (B) ensure you vector has labels that match the rest of the data (in casenullp
handles this). I think (A) suffices, but it's been a while so if you get an error message then you probably need (B) too.