Entering edit mode
18 months ago
Jalil Sharif
▴
80
Hello,
I reviewed this link 3utr/5utr extraction from Gencode, I want to change and adapt the code as part of an R pipeline, where the genes$type field is replaced. I tried to adapt the code, so it encompassed all UTRs not just protein coding.
working_directory_path <- getwd()
library(GenomicRanges) # From Bioconductor.
library(rtracklayer)
Reference_genome_gtf_path <- file.path(working_directory_path, "gencode.v43.annotation.gtf")
# Load the GTF file containing genomic annotations
GTF <- import(Reference_genome_gtf_path)
# Convert the GTF object to a data frame
genes <- as.data.frame((GTF))
whichCodingTranscripts <- genes$type == "transcript"
allFeaturesTranscripts <- genes$transcript_id
strands <- genes$strand
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
whichCDS <- genes$type == "CDS"
transcriptsCDS <- genes[whichCDS, ]
transcriptsCDS <- split(GRanges(transcriptsCDS$seqnames, IRanges(transcriptsCDS$start, transcriptsCDS$end), transcriptsCDS$strand),
factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
firstCDS <- mapply(function(CDS, strand) {if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands)
lastCDS <- mapply(function(CDS, strand) { if (strand == '+') CDS[length(CDS)] else CDS[1] }, transcriptsCDS, strands)
whichUTR <- genes$type == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR$seqnames, IRanges(transcriptsUTR$start, transcriptsUTR$end), transcriptsUTR$strand),
factor(allFeaturesTranscripts[whichUTR], levels = proteinTranscriptsNames))
transcriptsUTR5 <- mapply(function(UTR, CDS, strand) {
if (strand == '+') UTR[UTR < CDS[1]] else UTR[UTR > CDS[length(CDS)]]
}, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)
transcriptsUTR3 <- mapply(function(UTR, CDS, strand) {
if (strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]
}, transcriptsUTR, lastCDS, as.list(strands), SIMPLIFY = FALSE)
In the end I want the results of transcriptsUTR5 and transcriptsUTR3 to replace the generic genes$type UTR
value.
The code doesn't appear to work.