I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package.
Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database.
Consequently I am trying to use the makeTranscriptDbFromGFF()
function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work.
Here is what I've got:
chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985),
is_circular= rep(FALSE, 8))
exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf",
format = "gtf",
exonRankAttributeName="exon_number",
chrominfo=chrominfo)
Unfortunately this is the output:
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Prepare the 'metadata' data frame ... metadata: OK
Error in .normargTranscripts(transcripts) :
values in 'transcripts$tx_strand' must be "+" or "-"
In addition: Warning message:
In if (is.na(chrominfo)) { :
the condition has length > 1 and only the first element will be used
What am I missing?
This is the top of my gtf file:
chr1 Cufflinks exon 6524 6620 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1 Cufflinks exon 7098 7366 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1 Cufflinks exon 14514 14556 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
chr1 Cufflinks exon 15503 15729 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
chr1 Cufflinks exon 16283 16326 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
chr1 Cufflinks exon 17061 17304 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
chr1 Cufflinks exon 18242 18382 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
Note that unlike Creat A Db By Maketranscriptdbfromgff Of Genomicfeatures In Bioconductor removing
exonRankAttributeName
does not help.I am pretty sure there are some lines with a dot i.e. '.' instead of a '+' or a '-'
Do something like
grep -P '\t\.\t\.\t' merged.gtf | less
and you will get those lines. If you want to remove them, then useand use
tmp.gtf
inmakeTranscriptDbFromGFF()
.