Question

Genomic Features - makeTranscriptDbFromGFF()

0

Entering edit mode

10.5 years ago

gtho123 ▴ 260

I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package.

Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database.

Consequently I am trying to use the makeTranscriptDbFromGFF() function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work.

Here is what I've got:

chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
                        length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985),
                        is_circular= rep(FALSE, 8))

exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf",
                             format = "gtf",
                             exonRankAttributeName="exon_number",
                             chrominfo=chrominfo)

Unfortunately this is the output:

extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Prepare the 'metadata' data frame ... metadata: OK
Error in .normargTranscripts(transcripts) :
  values in 'transcripts$tx_strand' must be "+" or "-"
In addition: Warning message:
In if (is.na(chrominfo)) { :
  the condition has length > 1 and only the first element will be used

What am I missing?

This is the top of my gtf file:

chr1    Cufflinks    exon    6524    6620    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1    Cufflinks    exon    7098    7366    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
chr1    Cufflinks    exon    14514    14556    .    +    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
chr1    Cufflinks    exon    15503    15729    .    +    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
chr1    Cufflinks    exon    16283    16326    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
chr1    Cufflinks    exon    17061    17304    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
chr1    Cufflinks    exon    18242    18382    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";

genome software-error R RNA-Seq bioconductor • 7.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by gtho123 ▴ 260

0

Entering edit mode

Note that unlike Creat A Db By Maketranscriptdbfromgff Of Genomicfeatures In Bioconductor removing exonRankAttributeName does not help.

ADD REPLY • link 10.5 years ago by gtho123 ▴ 260

0

Entering edit mode

I am pretty sure there are some lines with a dot i.e. '.' instead of a '+' or a '-'

Do something like grep -P '\t\.\t\.\t' merged.gtf | less and you will get those lines. If you want to remove them, then use

grep -v -P '\t\.\t\.\t' merged.gtf > tmp.gtf

and use tmp.gtf in makeTranscriptDbFromGFF().

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

Ram · Answer 1 · 2014-06-16

1

Entering edit mode

10.5 years ago

Giovanni M Dall'Olio 28k

The error message is:

values in 'transcripts$tx_strand' must be "+" or "-"'.

So, there it must be a problem with how the strand is defined in your input file. Check if any line of your input file contains a character different from + and - in the 7th column.

ADD COMMENT • link 10.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Turns out there was one (weird that its only one) exon with a "." in the 7th column. I removed that row and re ran the code but it still did not work. This is what happened.

extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Prepare the 'metadata' data frame ... metadata: OK
Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",  :
  all the values in 'transcripts$tx_chrom' must be present in 'chrominfo$chrom'
In addition: Warning messages:
1: In if is.na(chrominfo)) { :
  the condition has length > 1 and only the first element will be used
2: In .normargSplicings(splicings, transcripts_tx_id) :
  no CDS information for this TranscriptDb object

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by gtho123 ▴ 260

1

Entering edit mode

gtho123 Does 'chrominfo' contain all the chromosomes that are present in the GTF file? Can you find out how many unique chromosomes are there in your GTF file using:

cut -f1 merged.gtf | uniq

There are 8 chromosomes (chr1...chr8) in 'chrominfo'. This error means that there are more chromosomes in your GTF than there are in 'chrominfo'. Maybe the ones like chr*random... Remove those and then rerun your function.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Ahh, that was indeed the case my gtf file also has a lot of short scaffolds that are not so interesting at the moment. I removed them from the gtf file and got this:

Error in .parse_attrCol(attrCol, file, colnames) :
  Some attributes do not conform to 'tag value' format

I think this means that somehow my gtf file has got a formatting issue somewhere. I removed the scaffolds in R and then wrote the new gtf file using:

write.table(merged_condensed, file = "merged_condensed.gtf", sep = "\t", row.names = FALSE, col.names = FALSE)

How can I check my gtf file is formatted correctly?

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by gtho123 ▴ 260

0

Entering edit mode

Hmm, can you paste the first few lines of your GTF file? And can you tell me which version of rtracklayer are you using?

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Here are the first few:

"chr1"    "Cufflinks"    "exon"    6524    6620    "."    "+"    "."    "gene_id XLOC_000001; transcript_id TCONS_00000001; exon_number 1; gene_name Medtr1g004940; oId Medtr1g004940.1; nearest_ref Medtr1g004940.1; class_code =; tss_id TSS1; p_id P1;"
"chr1"    "Cufflinks"    "exon"    7098    7366    "."    "+"    "."    "gene_id XLOC_000001; transcript_id TCONS_00000001; exon_number 2; gene_name Medtr1g004940; oId Medtr1g004940.1; nearest_ref Medtr1g004940.1; class_code =; tss_id TSS1; p_id P1;"
"chr1"    "Cufflinks"    "exon"    14514    14556    "."    "+"    "."    "gene_id XLOC_000002; transcript_id TCONS_00000002; exon_number 1; gene_name Medtr1g004950; oId Medtr1g004950.1; nearest_ref Medtr1g004950.1; class_code =; tss_id TSS2; p_id P2;"
"chr1"    "Cufflinks"    "exon"    15503    15729    "."    "+"    "."    "gene_id XLOC_000002; transcript_id TCONS_00000002; exon_number 2; gene_name Medtr1g004950; oId Medtr1g004950.1; nearest_ref Medtr1g004950.1; class_code =; tss_id TSS2; p_id P2;"
"chr1"    "Cufflinks"    "exon"    16283    16326    "."    "+"    "."    "gene_id XLOC_000003; transcript_id TCONS_00000003; exon_number 1; gene_name Medtr1g004960; oId Medtr1g004960.1; nearest_ref Medtr1g004960.1; class_code =; tss_id TSS3; p_id P3;"
"chr1"    "Cufflinks"    "exon"    17061    17304    "."    "+"    "."    "gene_id XLOC_000003; transcript_id TCONS_00000003; exon_number 2; gene_name Medtr1g004960; oId Medtr1g004960.1; nearest_ref Medtr1g004960.1; class_code =; tss_id TSS3; p_id P3;"
"chr1"    "Cufflinks"    "exon"    18242    18382    "."    "+"    "."    "gene_id XLOC_000003; transcript_id TCONS_00000003; exon_number 3; gene_name Medtr1g004960; oId Medtr1g004960.1; nearest_ref Medtr1g004960.1; class_code =; tss_id TSS3; p_id P3;"
"chr1"    "Cufflinks"    "exon"    31973    32344    "."    "+"    "."    "gene_id XLOC_000004; transcript_id TCONS_00000004; exon_number 1; gene_name Medtr1g004980; oId Medtr1g004980.1; nearest_ref Medtr1g004980.1; class_code =; tss_id TSS4; p_id P4;"
"chr1"    "Cufflinks"    "exon"    35910    36084    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 1; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"
"chr1"    "Cufflinks"    "exon"    36177    36217    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 2; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"
"chr1"    "Cufflinks"    "exon"    36788    36857    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 3; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"
"chr1"    "Cufflinks"    "exon"    36968    37116    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 4; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"
"chr1"    "Cufflinks"    "exon"    37379    37474    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 5; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"
"chr1"    "Cufflinks"    "exon"    37602    37672    "."    "+"    "."    "gene_id XLOC_000005; transcript_id TCONS_00000006; exon_number 6; gene_name Medtr1g004990; oId Medtr1g004990.2; nearest_ref Medtr1g004990.2; class_code =; tss_id TSS5; p_id P5;"

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by gtho123 ▴ 260

0

Entering edit mode

First of all, why are you using R to remove the unwanted chromosomes? It is messing up with your file. You see the ideal file looks like this:

chr1    Cufflinks    exon    29554    30039    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "MIR1302-11"; oId "ENST00000473358.1"; nearest_ref "ENST00000473358.1"; class_code "="; tss_id "TSS1";
chr1    Cufflinks    exon    30564    30667    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "MIR1302-11"; oId "ENST00000473358.1"; nearest_ref "ENST00000473358.1"; class_code "="; tss_id "TSS1";
chr1    Cufflinks    exon    30976    31097    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "MIR1302-11"; oId "ENST00000473358.1"; nearest_ref "ENST00000473358.1"; class_code "="; tss_id "TSS1";

You have unwanted quotes in some columns, and in some columns the quotes are not present where they are required. I don't know how you removed your unwanted chromosomes, but I would suggest, do this:

grep -v 'chr.*random' merged.gtf > tmp.gtf

Use grep and not R for removing your unwanted chromosomes.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

Ram · Answer 2 · 2014-06-16

1

Entering edit mode

10.5 years ago

komal.rathi ★ 4.1k

I am pretty sure there are some lines with a dot i.e. '.' instead of a '+' or a '-'

Do something like grep -P '\t\.\t\.\t' merged.gtf | less and you will get those lines. If you want to remove them, then use

grep -v -P '\t\.\t\.\t' merged.gtf > tmp.gtf

and use tmp.gtf in makeTranscriptDbFromGFF().

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

an alternative way to check which lines do not contain + or -: awk '$1 !~ /[+-]/' file.gff.

ADD REPLY • link 10.5 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Thanks for the input. And I think it should be awk '$7 !~ /[+-]/' file.gff ($7 instead of $1) because that's the column for strand :)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

yes, you are right :-)

ADD REPLY • link 10.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thanks, I gave this a go, but see my comment to Giovanni.

ADD REPLY • link 10.5 years ago by gtho123 ▴ 260