How to convert ncbi gff file to ensembl gff format
0
0
Entering edit mode
2.2 years ago
yoser4 ▴ 10

Hello all,

I found 14 snp sites that I'm interested in that fall in the CDS region (I have a vcf file containing snp site information), and I'm trying to use bcftools csq to analyze the effect of gene mutations on protein levels.

The code is as follows:

bcftools csq -f csq.fa -g csq.gff3 csq.vcf > csq.out

But here I have a problem. The reference gene (fa) and annotation file (gtf) I used are from ncbi, so I downloaded the gff file from ncbi, and bcftools csq reads the gff3 file of ensemble type by default, and an error is generated here:

Parsing GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.gff ...
ignored: chr1   RefSeq  region  1       278617202       .       +       .       ID=NC_056054.1:1..278617202;Dbxref=taxon:9940;Name=1;breed=Rambouillet;chromosome=1;collection-date=2015-03-01;dev-stage=adult;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;strain=OAR_USU_Benz2616;tissue-type=Blood%2C lung%2C liver
ignored: chr1   Gnomon  pseudogene      42249   46639   .       -       .       ID=gene-LOC114110831;Dbxref=GeneID:114110831;Name=LOC114110831;gbkey=Gene;gene=LOC114110831;gene_biotype=pseudogene;pseudo=true
[csq.c:716 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1       Gnomon  exon    42249   43660   .       -       .       ID=id-LOC114110831;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true

I guess it is because the gff of ncbi and the gff of ensenbl are in different formats. Here I would like to ask:

  1. How should I modify my gff file from ncbi to make it into ensemble format
  2. Or, does anyone know how ensemble generates the gff3 file, I can make it myself
  3. Or, does anyone know if there are other methods to analyze the effect of gene mutation on protein level
bcftools gff • 3.4k views
ADD COMMENT
0
Entering edit mode

Could you display the line containing ID=id-LOC114110831;Parent=gene-LOC114110831 and the surrounding lines?

ADD REPLY
0
Entering edit mode

Are you talking about the content of the file? it looks like this

ADD REPLY
0
Entering edit mode

enter image description here

ADD REPLY
0
Entering edit mode
NC_056054.1     RefSeq  region  1       278617202       .       +       .       ID=NC_056054.1:1..278617202;Dbxref=taxon:9940;Name=1;breed=Rambouillet;chromosome=1;collection-date=2015-03-01;dev-stage=adult;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;strain=OAR_USU_Benz2616;tissue-type=Blood%2C lung%2C liver
NC_056054.1     Gnomon  pseudogene      42249   46639   .       -       .       ID=gene-LOC114110831;Dbxref=GeneID:114110831;Name=LOC114110831;gbkey=Gene;gene=LOC114110831;gene_biotype=pseudogene;pseudo=true
NC_056054.1     Gnomon  exon    42249   43660   .       -       .       ID=id-LOC114110831;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
NC_056054.1     Gnomon  exon    43959   44085   .       -       .       ID=id-LOC114110831-2;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
NC_056054.1     Gnomon  exon    46503   46639   .       -       .       ID=id-LOC114110831-3;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
NC_056054.1     Gnomon  gene    46755   48356   .       -       .       ID=gene-LOC114112203;Dbxref=GeneID:114112203;Name=LOC114112203;gbkey=Gene;gene=LOC114112203;gene_biotype=protein_coding
ADD REPLY
0
Entering edit mode

From the manual:



    GFF3 annotation file (required), such as ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens. An example of a minimal working GFF file:

    # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
    # looks up their parent transcript (determined from the "Parent=transcript:" attribute),
    # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
    # (the most interesting is "protein_coding").
    #
    # Attributes required for
    #   gene lines:
    #   - ID=gene:<gene_id>
    #   - biotype=<biotype>
    #   - Name=<gene_name>      [optional]
    #
    #   transcript lines:
    #   - ID=transcript:<transcript_id>
    #   - Parent=gene:<gene_id>
    #   - biotype=<biotype>
    #
    #   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
    #   - Parent=transcript:<transcript_id>
    #
    # Supported biotypes:
    #   - see the function gff_parse_biotype() in bcftools/csq.c

    1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
    1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
    1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
    1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
    1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
    1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId

So yes the easiest would be to shift to use the GFF from ensembl => http://www.ensembl.org/info/data/ftp/index.html/

ADD REPLY
0
Entering edit mode

Thank you, Juke. But the problem I encountered is that ensembl does not currently update the reference genome and index of sheep ramb2.0. Currently I can only download GFF files to v2.0 from ncbi. Because my project requires me to use version 2.0. This leaves me with no choice.

ADD REPLY
0
Entering edit mode

I think my only solution at the moment is to convert the format. I'm new to bioinformatics and I don't know what I should do now

ADD REPLY
0
Entering edit mode

You might find some help using AGAT. e.g. you could use agat_sp_separate_by_record_type.pl and then analyse only protein coding genes.

ADD REPLY
0
Entering edit mode

sorry for the late reply. I tried the method you mentioned. But there are a series of errors. I think it's still because the GFF file format of the ramb v2.0 version I'm using is so special that AGAT doesn't recognize it.

ADD REPLY
0
Entering edit mode

    • End parsing - *
  • done in 320 seconds *


    • Start checks - *

---------------------------- Check1: feature types ----------------------------- ----------------------------------- ontology ----------------------------------- INFO - Feature types not expected by the GFF3 specification:

  • lnc_rna
  • sequence_feature The feature type (3rd column in GFF3) is constrained to be either a term from th e Sequence Ontology or an SO accession number. The latter alternative is disting uished using the syntax SO:000000. In either case, it must be sequence_feature ( SO:0000110) or an is_a child of it.To follow rigorously the gff3 format, please visit this website: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md ------------------------------------- agat ------------------------------------- WARNING - Feature types not expected by AGAT:
  • origin_of_replication The feature of these types (3rd column in GFF3) are skipped by the parser! To take them into account you must update the feature json files. To access the json files run:
                      agat_convert_sp_gxf2gxf.pl --expose
    
    In which file to add my feature?
  • Feature level1 (e.g. gene, match, region): My feature has no parent => features_level1.json
  • Feature level2 (e.g. mrna, match_part, trna): My feature has one parent and children => features_level2.json.
  • Feature level3 (e.g. exon, intron, cds): My feature has one parent (the parent has also a parent) and no children => features_level3.json.
  • Feature level3 discontinuous (e.g. cds, utr): A single feature that exists over multiple genomic locations => features_spread.json. ------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------ None found ------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket --------------------------- None found ------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ---------------------------- 3627 cases fixed where L3 features have parent feature(s) missing ------------------------------ done in 1 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ---------------------------- 119 cases fixed where L2 features have parent features missing ------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 --------------------------- We remove only those not supposed to be orphan 5914 cases removed where L1 features do not have children (while they are suposed to have children). ------------------------------ done in 1 seconds -------------------------------

------------------------- Check7: all level3 locations ------------------------- ------------------------------ done in 23 seconds ------------------------------

------------------------------ Check8: check cds ------------------------------- No problem found ------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------ 13 exons created that were missing No exons locations modified No supernumerary exons removed No level2 locations modified ------------------------------ done in 36 seconds ------------------------------

----------------------------- Check10: check utrs ------------------------------ 166991 UTRs created that were missing No UTRs locations modified No supernumerary UTRs removed ------------------------------ done in 30 seconds ------------------------------

------------------------ Check11: all level2 locations ------------------------- No problem found ------------------------------ done in 22 seconds ------------------------------

------------------------ Check12: all level1 locations ------------------------- We fixed 49 wrong level1 location cases ------------------------------ done in 1 seconds -------------------------------

---------------------- Check13: remove identical isoforms ---------------------- None found ------------------------------ done in 85 seconds ------------------------------


    • End checks - *
  • done in 199 seconds *

=> OmniscientI total time: 519 seconds GFF3 file parsed Job done in 609 seconds

ADD REPLY
0
Entering edit mode

There are two main types of errors (similar to them appear so many times that I can't list them all): WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: 5772b24a-4ae1-4a4e-9b96-a403ef75d22a). AGAT does not deal with that currently. @ the feature is: NC_056054.1 RefSeq cDNA_match 147143 148998 . - . Gap "M675 I1 M152 D1 M48 I2 M694 D1 M285" ; ID "nbis-cdna_match-1" ; Target "XM_042239135.1 1456 3312 +" ; for_remapping 2 ; gap_count 4 ; num_ident 3309 ; num_mismatch 0 ; pct_coverage "99.9094" ; pct_coverage_hiqual "99.9094" ; pct_identity_gap "99.8491" ; pct_identity_ungap 100 ; rank 1 Indeed we changed the ID for this feature l1 to be uniq but we do not change the parent attribute for the subfeatures because we do not know to which L1 they are really linked to. As it is now we will probably end up with chimeric records.

ADD REPLY
0
Entering edit mode

3 warning messages: WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: f4c81044-5a57-42df-ae1f-5deecdf5a598). AGAT does not deal with that currently.

ADD REPLY
0
Entering edit mode

Sounds fine as long as it touch only features you do not care e.g cDNA_match.

ADD REPLY
0
Entering edit mode

c_gene_segment.gff
lnc_rna.gff
mrna.gff
region.gff
rrna.gff
snorna.gff
transcript.gff
v_gene_segment.gff guide_rna.gff
mirna.gff
primary_transcript.gff rna.gff
sequence_feature.gff
snrna.gff
trna.gff

Thanks. Currently I get the following output file. But I didn't find an explanation on what these output files mean on the AGAT website. So that I don't know what some output files represent, what output files can I use? Where can I see an explanation about these output files? I think I'm more interested in CDS

ADD REPLY
0
Entering edit mode

Those are the different type of features described in your original file. Only mrna.gff and maybe transcript.gff should contain CDS features.

ADD REPLY
0
Entering edit mode

thank you , Juke

ADD REPLY

Login before adding your answer.

Traffic: 1582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6