Hi,
I would like to analyse the content of a GTF file. I am quite able with R and dplyr, so I would like to transform my GTF file into a data frame to facilitate my analysis. Does anybody know of any tool to do this?
Thanks. Best, C.
Hi,
I would like to analyse the content of a GTF file. I am quite able with R and dplyr, so I would like to transform my GTF file into a data frame to facilitate my analysis. Does anybody know of any tool to do this?
Thanks. Best, C.
try gtf_df=as.data.frame(gtf)
after importing via import function from rtracklayer.
The code would be:
gtf <- rtracklayer::import('celegans.gtf')
gtf_df=as.data.frame(gtf)
Hi,
This worked:
gtf2 <- read.table('celegans.gtf', header = FALSE, sep = '\t')
gff <- read.delim('wormbase.gff3', header = FALSE, sep = '\t', skip = 8)
I forgot to specify the tab delimiter in the read.table() function. I thought that it was the default but it isn't.
The answer of cpad0112 is much better though because with his way, all the meta information of the 9th column is put in separate columns whereas with my way, all the meta information is all in one column.
Best, C.
No idea about the performance of rtracklayer::import
, most likely it's pretty optimized but just in case you wanted to forgo that package, here's how I did it for educational purposes (getting to know the GTF format)
library(data.table)
genes <- fread("gencode.basic.gtf")
setnames(genes, names(genes), c("chr","source","type","start","end","score","strand","phase","attributes") )
# [optional] focus, for example, only on entries of type "gene",
# which will drastically reduce the file size
genes <- genes[type == "gene"]
# the problem is the attributes column that tends to be a collection
# of the bits of information you're actually interested in
# in order to pull out just the information I want based on the
# tag name, e.g. "gene_id", I have the following function:
extract_attributes <- function(gtf_attributes, att_of_interest){
att <- strsplit(gtf_attributes, "; ")
att <- gsub("\"","",unlist(att))
if(!is.null(unlist(strsplit(att[grep(att_of_interest, att)], " ")))){
return( unlist(strsplit(att[grep(att_of_interest, att)], " "))[2])
}else{
return(NA)}
}
# this is how to, for example, extract the values for the attributes of interest (here: "gene_id")
genes$gene_id <- unlist(lapply(genes$attributes, extract_attributes, "gene_id"))
Thanks Friederike, great answer, that was very useful. When I tried to extract the last field of the attributes it still had ; appended. To solve that and speed up the function a bit, I adjusted it as follows:
extract_attributes <- function(gtf_attributes, att_of_interest){
att <- unlist(strsplit(gtf_attributes, " "))
if(att_of_interest %in% att){
return(gsub("\"|;","", att[which(att %in% att_of_interest)+1]))
} else {
return(NA)}
}
This can be used exactly as above with unlist
and lapply
.
Hi,
If you want to read the GTF with R you first need to transform it into a table in which the attribute names will be used as header. You can use gtftk tabulate:
gtftk get_example | gtftk tabulate --key '*' --accept-undef -o example.tsv
Then you can simply load this tsv file into R using read.table
d <- read.table("example.tsv", header=T, sep="\t")
View(d)
Best
disclaimer: I'm the developer of pygtftk
I don't like how rtracklayer::import
seems to be finicky about gtf format, so here's a solution that uses base R v.4.0.0 and tidyverse v.1.3.1 to read in a gtf file as a tibble. It's a little slow, but it properly parses the attribute
column in cases where the number and types of attributes are inconsistent between features. It also handles the annoying cases when the attribute field separator (;
) is found in quoted attribute strings (as in this gtf file for Aeropyrum pernix).
Big thanks to akrun on Stack Overflow for figuring out the regex.
Since OP asked for a data frame output...
I would like to transform my GTF file into a data frame to facilitate my analysis
note that the tibble result can be readily transformed to a data frame with base::as.data.frame()
.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Have you tried anything? If so, show it. Or, find a tutorial, try some code and come back with any errors and edit the OP with code and said errors.
I have got this working:
Bash:
R:
and it returns a well-formatted GRanges object.
However, in R, I cannot import it with read.table():
Isn't GTF just a tsv file?
Hi,
Yes, with some header lines at the top starting with '#', that should be ignored by default by read.table(). See my comment above to look at the problem I am having importing the GTF as a data frame into R.