Transform a GTF file into a data frame in R
4
11
Entering edit mode
7.2 years ago
biomagician ▴ 410

Hi,

I would like to analyse the content of a GTF file. I am quite able with R and dplyr, so I would like to transform my GTF file into a data frame to facilitate my analysis. Does anybody know of any tool to do this?

Thanks. Best, C.

R RNA-Seq GTF • 42k views
ADD COMMENT
3
Entering edit mode

I am quite able with R

Have you tried anything? If so, show it. Or, find a tutorial, try some code and come back with any errors and edit the OP with code and said errors.

ADD REPLY
0
Entering edit mode

I have got this working:

Bash:

head celegans.gtf
#!genome-build WBcel235
#!genome-version WBcel235
#!genome-date 2012-12
#!genome-build-accession NCBI:GCA_000002985.3
#!genebuild-last-updated 2014-10
V   WormBase    gene    180 329 .   +   .   gene_id "WBGene00197333"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA";
V   WormBase    transcript  180 329 .   +   .   gene_id "WBGene00197333"; transcript_id "cTel3X.2"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA"; transcript_name "cTel3X.2"; transcript_source "WormBase"; transcript_biotype "ncRNA";

R:

gtf <- rtracklayer::import('celegans.gtf')

and it returns a well-formatted GRanges object.

However, in R, I cannot import it with read.table():

gtf2 <- read.table('celegans.gtf', header = FALSE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 38 elements
ADD REPLY
1
Entering edit mode

Isn't GTF just a tsv file?

ADD REPLY
0
Entering edit mode

Hi,

Yes, with some header lines at the top starting with '#', that should be ignored by default by read.table(). See my comment above to look at the problem I am having importing the GTF as a data frame into R.

ADD REPLY
36
Entering edit mode
7.2 years ago

try gtf_df=as.data.frame(gtf) after importing via import function from rtracklayer.

The code would be:

gtf <- rtracklayer::import('celegans.gtf')
gtf_df=as.data.frame(gtf)
ADD COMMENT
2
Entering edit mode

Can you transform your comment into a formal answer, by clicking on "moderate" and selecting the appropriate option, so that cristian can "accept" the answer ? This will make the information clearer for the future readers !

ADD REPLY
0
Entering edit mode

This is the best answer, thanks!

ADD REPLY
0
Entering edit mode

This worked for me thanks! Really helpful :)

ADD REPLY
0
Entering edit mode

Please note that a GTF file is a hierarchical file structure as indicated by the 3rd column (the type column) and you might need to subset/filter via this column to extract what you need.

ADD REPLY
0
Entering edit mode

Still works almost five years later...

ADD REPLY
3
Entering edit mode
7.2 years ago
biomagician ▴ 410

Hi,

This worked:

gtf2 <- read.table('celegans.gtf', header = FALSE, sep = '\t')

gff <- read.delim('wormbase.gff3', header = FALSE, sep = '\t', skip = 8)

I forgot to specify the tab delimiter in the read.table() function. I thought that it was the default but it isn't.

The answer of cpad0112 is much better though because with his way, all the meta information of the 9th column is put in separate columns whereas with my way, all the meta information is all in one column.

Best, C.

ADD COMMENT
10
Entering edit mode

No idea about the performance of rtracklayer::import, most likely it's pretty optimized but just in case you wanted to forgo that package, here's how I did it for educational purposes (getting to know the GTF format)

library(data.table)
genes <- fread("gencode.basic.gtf")
setnames(genes, names(genes), c("chr","source","type","start","end","score","strand","phase","attributes") )

# [optional] focus, for example, only on entries of type "gene", 
# which will drastically reduce the file size
genes <- genes[type == "gene"]

# the problem is the attributes column that tends to be a collection
# of the bits of information you're actually interested in
# in order to pull out just the information I want based on the 
# tag name, e.g. "gene_id", I have the following function:
extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- strsplit(gtf_attributes, "; ")
  att <- gsub("\"","",unlist(att))
  if(!is.null(unlist(strsplit(att[grep(att_of_interest, att)], " ")))){
    return( unlist(strsplit(att[grep(att_of_interest, att)], " "))[2])
  }else{
    return(NA)}
}

# this is how to, for example, extract the values for the attributes of interest (here: "gene_id")
genes$gene_id <- unlist(lapply(genes$attributes, extract_attributes, "gene_id"))
ADD REPLY
2
Entering edit mode

Thanks Friederike, great answer, that was very useful. When I tried to extract the last field of the attributes it still had ; appended. To solve that and speed up the function a bit, I adjusted it as follows:

extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- unlist(strsplit(gtf_attributes, " "))
  if(att_of_interest %in% att){
    return(gsub("\"|;","", att[which(att %in% att_of_interest)+1]))
  } else {
    return(NA)}
}

This can be used exactly as above with unlist and lapply.

ADD REPLY
0
Entering edit mode
4.1 years ago
D. Puthier ▴ 350

Hi,

If you want to read the GTF with R you first need to transform it into a table in which the attribute names will be used as header. You can use gtftk tabulate:

gtftk get_example | gtftk tabulate --key '*' --accept-undef  -o example.tsv

Then you can simply load this tsv file into R using read.table

d <- read.table("example.tsv", header=T, sep="\t")
View(d)

Best

disclaimer: I'm the developer of pygtftk

ADD COMMENT
0
Entering edit mode
3.0 years ago
acvill ▴ 350

I don't like how rtracklayer::import seems to be finicky about gtf format, so here's a solution that uses base R v.4.0.0 and tidyverse v.1.3.1 to read in a gtf file as a tibble. It's a little slow, but it properly parses the attribute column in cases where the number and types of attributes are inconsistent between features. It also handles the annoying cases when the attribute field separator (;) is found in quoted attribute strings (as in this gtf file for Aeropyrum pernix).

Big thanks to akrun on Stack Overflow for figuring out the regex.

Since OP asked for a data frame output...

I would like to transform my GTF file into a data frame to facilitate my analysis

note that the tibble result can be readily transformed to a data frame with base::as.data.frame().

ADD COMMENT
1
Entering edit mode

acvill : You can simply include a direct link to the gist. Biostar code understands these and will happily parse the link so the code shows in line.

ADD REPLY
0
Entering edit mode

Thanks for the tip! I've edited my answer

ADD REPLY

Login before adding your answer.

Traffic: 2332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6