Hi,
I have a question about plastid python package. When I follow the tutorial, one of the step for preparing the datas include the generation of a "windows" file to be able to make a matagene analysis on ribosome profiling data. The command "metagene generate" ask for an annotation_file in GTF2 format. As previously explained in the tutorial, it's possible to generate it from within plastid with the following command:
reformat_transcripts --annotation_files yeast.gff --annotation_format GFF3 --sorted --output_format GTF2 yeast.gtf2
Then we use this plastid-generated GTF2 file as annotation file for the metagene window-file generation command :
metagene generate --annotation_files yeast.gtf2 --sorted --mask_annotation_files yeast.bb --mask_annotation_format BigBed --downstream yeast_windows
The file is created but I get the following warning :
DataWarning
All maximal spanning windows lack flanks upstream of reference landmark. This
occurs e.g. for start codons when annotation files don't contain UTR data.
Please check your annotation file.
in /path/to/pyscript/metagene.py, line 709:
707 if (df["alignment_offset"] == flank_upstream).all():
708 warnings.warn("All maximal spanning windows lack flanks upstream of reference landmark. This occurs e.g. for start codons when annotation files don't contain UTR data. Please check your annotation file.",
709 DataWarning)
710
711 # N.b. This warning will only be invoked for zero-length landmarks
It says that my "annotation files doesn't contain UTR data".
The .gff file comes from yeastgenome's last genome release (31-Jan-2015 15:11) .zip file. The generated .gtf2 file contains the following features (inspected with R) :
> handleGTF <- import("saccharomyces_cerevisiae_R64-2-1_20150113.gtf2","gtf")
> levels(handleGTF$type)
[1] "exon" "CDS" "start_codon" "stop_codon"
But when I inspect the levels of the original .gff file, I get the following :
> handleGFF <- import("saccharomyces_cerevisiae_R64-2-1_20150113.gff","gff")
> levels(handleGFF$type)
> levels(handle$type)
[1] "chromosome" "telomere"
[3] "X_element" "X_element_combinatorial_repeat"
[5] "telomeric_repeat" "gene"
[7] "CDS" "mRNA"
[9] "ARS" "long_terminal_repeat"
[11] "region" "ARS_consensus_sequence"
[13] "intron" "ncRNA_gene"
[15] "noncoding_exon" "tRNA_gene"
[17] "snoRNA_gene" "centromere"
[19] "centromere_DNA_Element_I" "centromere_DNA_Element_II"
[21] "centromere_DNA_Element_III" "LTR_retrotransposon"
[23] "transposable_element_gene" "pseudogene"
[25] "Y_prime_element" "plus_1_translational_frameshift"
[27] "five_prime_UTR_intron" "telomerase_RNA_gene"
[29] "matrix_attachment_site" "snRNA_gene"
[31] "silent_mating_type_cassette_array" "W_region"
[33] "X_region" "Y_region"
[35] "Z1_region" "Z2_region"
[37] "mating_type_region" "intein_encoding_region"
[39] "blocked_reading_frame" "rRNA_gene"
[41] "external_transcribed_spacer_region" "internal_transcribed_spacer_region"
[43] "non_transcribed_region" "origin_of_replication"
Why can't plastid see the UTR regions ? Is my original GFF lacking the info ? Or do I have to put the UTR regions in the GTF2 file myself ?
If anyone has an experience with plastid package, I'll be glad to have any helping information or suggestion
Thank you
Anwsering my own post: this is a known issue. See this Github issue.