extracting gencode 3utr and 5utr as part of R pipeline
0
0
Entering edit mode
18 months ago
Jalil Sharif ▴ 80

Hello,

I reviewed this link 3utr/5utr extraction from Gencode, I want to change and adapt the code as part of an R pipeline, where the genes$type field is replaced. I tried to adapt the code, so it encompassed all UTRs not just protein coding.

working_directory_path <- getwd()

library(GenomicRanges) # From Bioconductor.
library(rtracklayer)

Reference_genome_gtf_path <- file.path(working_directory_path, "gencode.v43.annotation.gtf")

# Load the GTF file containing genomic annotations
GTF <- import(Reference_genome_gtf_path)

# Convert the GTF object to a data frame
genes <- as.data.frame((GTF))


whichCodingTranscripts <- genes$type == "transcript"
allFeaturesTranscripts <- genes$transcript_id

strands <- genes$strand
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
whichCDS <- genes$type == "CDS" 
transcriptsCDS <- genes[whichCDS, ]
transcriptsCDS <- split(GRanges(transcriptsCDS$seqnames, IRanges(transcriptsCDS$start, transcriptsCDS$end), transcriptsCDS$strand),
                        factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
firstCDS <- mapply(function(CDS, strand) {if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands)
lastCDS <- mapply(function(CDS, strand) { if (strand == '+') CDS[length(CDS)] else CDS[1] }, transcriptsCDS, strands)
whichUTR <- genes$type == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR$seqnames, IRanges(transcriptsUTR$start, transcriptsUTR$end), transcriptsUTR$strand),
                        factor(allFeaturesTranscripts[whichUTR], levels = proteinTranscriptsNames))

transcriptsUTR5 <- mapply(function(UTR, CDS, strand) {        
  if (strand == '+') UTR[UTR < CDS[1]] else UTR[UTR > CDS[length(CDS)]]
}, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)

transcriptsUTR3 <- mapply(function(UTR, CDS, strand) {        
  if (strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]
}, transcriptsUTR, lastCDS, as.list(strands), SIMPLIFY = FALSE)

In the end I want the results of transcriptsUTR5 and transcriptsUTR3 to replace the generic genes$type UTR value.

The code doesn't appear to work.

5utr 3utr • 389 views
ADD COMMENT

Login before adding your answer.

Traffic: 2748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6