GFF file format conversion
2
0
Entering edit mode
2.3 years ago
yoser4 ▴ 10

Hello everyone. I now have GFF files downloaded from NCBI in the following format:

NC_056054.1     RefSeq  region  1       278617202       .       +       .       ID=NC_056054.1:1..278617202;Dbxref=taxon:9940;Name=1;breed=Rambouillet;chromos
NC_056054.1     Gnomon  pseudogene      42249   46639   .       -       .       ID=gene-LOC114110831;Dbxref=GeneID:114110831;Name=LOC114110831;gbkey=Gene;gene
NC_056054.1     Gnomon  exon    42249   43660   .       -       .       ID=id-LOC114110831;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LO
NC_056054.1     Gnomon  exon    43959   44085   .       -       .       ID=id-LOC114110831-2;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=
NC_056054.1     Gnomon  exon    46503   46639   .       -       .       ID=id-LOC114110831-3;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=
NC_056054.1     Gnomon  gene    46755   48356   .       -       .       ID=gene-LOC114112203;Dbxref=GeneID:114112203;Name=LOC114112203;gbkey=Gene;gene=LOC1141
NC_056054.1     Gnomon  mRNA    46755   48356   .       -       .       ID=rna-XM_027963747.2;Parent=gene-LOC114112203;Dbxref=GeneID:114112203,Genbank:XM_0279
NC_056054.1     Gnomon  exon    46755   48356   .       -       .       ID=exon-XM_027963747.2-1;Parent=rna-XM_027963747.2;Dbxref=GeneID:114112203,Genbank:XM_
NC_056054.1     Gnomon  CDS     46755   48356   .       -       0       ID=cds-XP_027819548.2;Parent=rna-XM_027963747.2;Dbxref=GeneID:114112203,Genbank:XP_027

I want to convert this file to ensenbl format, here is another version of GFF I downloaded from ensenbl in the following format:

1       ensembl gene    87434   89380   .       +       .       ID=gene:ENSOARG00020000042;Name=FAM240C;biotype=protein_coding;description=family with sequenc
1       ensembl mRNA    87434   89380   .       +       .       ID=transcript:ENSOART00020000042;Parent=gene:ENSOARG00020000042;Name=FAM240C-201;biotype=prote
1       ensembl exon    87434   87579   .       +       .       Parent=transcript:ENSOART00020000042;Name=ENSOARE00020000042;constitutive=1;ensembl_end_phase=
1       ensembl CDS     87434   87579   .       +       0       ID=CDS:ENSOARP00020000015;Parent=transcript:ENSOART00020000042;protein_id=ENSOARP00020000015
1       ensembl exon    89251   89305   .       +       .       Parent=transcript:ENSOART00020000042;Name=ENSOARE00020000043;constitutive=1;ensembl_end_phase=
1       ensembl CDS     89251   89305   .       +       1       ID=CDS:ENSOARP00020000015;Parent=transcript:ENSOART00020000042;protein_id=ENSOARP00020000015
1       ensembl exon    89307   89326   .       +       .       Parent=transcript:ENSOART00020000042;Name=ENSOARE00020000044;constitutive=1;ensembl_end_phase=
1       ensembl CDS     89307   89326   .       +       0       ID=CDS:ENSOARP00020000015;Parent=transcript:ENSOART00020000042;protein_id=ENSOARP00020000015
1       ensembl exon    89329   89380   .       +       .       Parent=transcript:ENSOART00020000042;Name=ENSOARE00020000045;constitutive=1;ensembl_end_phase=
1       ensembl CDS     89329   89380   .       +       1       ID=CDS:ENSOARP00020000015;Parent=transcript:ENSOART00020000042;protein_id=ENSOARP00020000015

Especially Parent= this place, they are so different that my other analysis software can't recognize them. I would like to ask if anyone knows of any software or code to accomplish what I have above, I would greatly appreciate it.

GFF • 1.7k views
ADD COMMENT
0
Entering edit mode

They are not conceptually different. Both are GFF3 files and a region of type gene is the parent for mRNA region which again is parent to exon and CDS. This is the same for both. Before we invest a lot of time, we should look into the specifics of the format understood by the analysis software and the exact error message. Which software are you using? It might just be a small hiccup because of a single attribute or character (e.g. : vs. -) in the identifiers.

ADD REPLY
0
Entering edit mode

hello michael I am using

bcftools csq

, the code is as follows:

bcftools csq -f csq.fa -g csq.gff3 csq.vcf > csq.out

The error I'm getting is:

Could not parse the line, "Parent=transcript:" not present: chr1       Gnomon  exon    42249   43660   .       -       .       ID=id-LOC114110831;Parent=gene-LOC114110831;Dbxref=GeneID:114110831;gbkey=exon;gene=LOC114110831;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 16%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
ADD REPLY
0
Entering edit mode
2.3 years ago
Michael 55k

Interesting. Seemingly it tries to interpret the Parent ids somehow, expects Parent=transcript: sees Parent=rna- and bails out. Could you check the documentation about other implicit assumptions? My opinion is that identifiers should not be parsed or interpreted by the software, but it should be easy to fix using a simple sed command like sed s/rna-/transcript:/ on the whole file, but other errors may come up after that.

ADD COMMENT
0
Entering edit mode

Thank you for your guidance,Michael

I tried the method you gave, and it is indeed because of the problem of symbols that there is a problem with reading the data. I solved the error by replacing. But the results I get don't seem to be what I want, and I may be more interested in calculating amino acid coding changes and predicting coding results. It may be that the understanding is wrong, so that the wrong software is used.

ADD REPLY
0
Entering edit mode

So, in principle, it solved the problem of feeding GenBank GFF file to bcftools? I will then leave this as an answer for reference. Possibly, you need to write another question about the task you wish to perform and be more specific about that. Possibly you want to use SnpEff? But then you still need to call variants.

ADD REPLY
0
Entering edit mode
2.3 years ago
Juke34 8.9k

You can try agat_sp_manage_IDs.pl from AGAT to manage the IDs on the mRNA subfeature you have extracted with agat_sp_separate_by_record_type.pl

ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6