Additional information:
This is a R-function I found on biostars. However, the last two lines give me the output (see below). Everything works well, however, the last line gives this error:
Error in .pcompare_GenomicRanges(x, y) :
the 2 objects to compare have seqlevels in incompatible orders
This here is the open source code that seems to generate this error, unfortunately I don't understand why this error is produced.
However, I think downloading the file on the UCSC browser is much more convenient, if the output is the same.
# The gtf file was downloaded here: http://www.gencodegenes.org/releases/current.html, Release 25 (GRCh38.p7)
# Just transcripts
genes <- read.table("/path/to/file.gtf", sep = '\t', skip = 5, stringsAsFactors = FALSE)
whichCodingTranscripts <- genes[, 3] == "transcript" & grepl("transcript_type protein_coding", genes[, 9], fixed = TRUE)
proteinTranscripts <- genes[whichCodingTranscripts, ]
# + or -
strands <- proteinTranscripts[, 7]
# Split the entries in the ninth column after each ";"
allFeaturesTranscripts <- gsub("transcript_id ", '', sapply(strsplit(genes[, 9], "; "), '[', 2))
# Return a vector of all Transcript names
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
# Both lines now will search for the protein coding part of all transcripts
whichCDS <- genes[, 3] == "CDS" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsCDS <- genes[whichCDS, ]
# contains information on all CDS per transcript
transcriptsCDS <- split(GRanges(transcriptsCDS[, 1], IRanges(transcriptsCDS[, 4], transcriptsCDS[, 5]), transcriptsCDS[, 7]),
factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
# Both functions search for the first and the last CDS on the transcript; pick either first or last CDS per transcript
##
firstCDS <- mapply(function(CDS, strand) {
if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands, SIMPLIFY = FALSE)
lastCDS <- mapply(function(CDS, strand) {
if(strand == '+') {CDS[length(CDS)]} else {CDS[1]}}, transcriptsCDS, strands, SIMPLIFY = FALSE)
firstCDS <- GRangesList(firstCDS)
lastCDS <- GRangesList(lastCDS)
# Now repeat the "workflow described above for UTRs
## The knowledge on the CTR is needed for
whichUTR <- genes[, 3] == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR[, 1], IRanges(transcriptsUTR[, 4], transcriptsUTR[, 5]), transcriptsUTR[, 7]),
factor(allFeaturesTranscripts[whichUTR], levels = names(firstCDS)))
# Here are the problems
transcriptsUTR5 <- mapply(function(UTR, CDS, strand){
if(strand == '+') {UTR[UTR < CDS[1]]} else {UTR[UTR > CDS[length(CDS)]]}},
transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)
transcriptsUTR3 <- mapply(function(UTR, CDS, strand){
if(strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]},
transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)
It's unclear what the desired output is of this analysis. Do you want to filter your bam file? Get the read IDs of those in the UTR? Count reads in UTRs?
I want to get the Read IDs of those in the UTR (two seperate files, one for 5' the other for 3'), but also count the reads. That map in each region.