How to combine 2 incomplete GTF
2
7
Entering edit mode
3.6 years ago
Rafael Soler ★ 1.3k

Hello,

I currently have two GTFs which are incomplete. What I mean by this?. For example, in one of them I have 5000 genes and in another I have 7000, however, there are 3000 genes that coincide (they are the same). Therefore, I would like to create a GTF with the 5000 + 7000 genes, but without duplicating the 3000 identical genes, so I would have in the end a GTF with 9000 genes (not 12000).

What could I do? I think both Cuffmerge and Cuffcompare are useless in these cases.

All the best

Cufflinks Cuffcompare GTF Cuffmerge Merge • 6.3k views
ADD COMMENT
1
Entering edit mode

Compare annotation tool agat_sp_compare_annotations.pl from AGAT toolkit. Install AGAT using conda.

ADD REPLY
0
Entering edit mode

Thank you!! I will try it :)

ADD REPLY
4
Entering edit mode
3.6 years ago
Juke34 8.9k

To give more precision to the GenoMax comment:

agat_sp_compare_annotations.pl will tell you the differences.

Then you can choose to complement the annotation using agat_sp_complement_annotations.pl, that means you use one annotation as reference and you will pick from the second annotation only non-overlaping predictions and add them to the reference annotation (i.e. for overlapping loci only the gene prediction from the reference is kept)

Or to merge annotation using agat_sp_merge_annotations.pl, that mean you take non overlaping locus from both annotations and you merge the annotation when they are overlapping (overlapping genes will be merged as one locus and different mRNA will be seen as isoforms, duplicates mRNA will be removed to keep one)

ADD COMMENT
0
Entering edit mode

Thank yo so much Juke!

ADD REPLY
0
Entering edit mode

Hi Juke,

I am trying to install it and I can't.

enter image description here

I have Perl 5.30 and R 4.0.2, but conda doesn't work with

conda install -c bioconda agat

ADD REPLY
1
Entering edit mode

Please create a new environment when installing packages. That avoids conflicts.

conda create -c bioconda -n agat agat 
conda activate agat
ADD REPLY
0
Entering edit mode

Thank you! I am going to try it :)

ADD REPLY
0
Entering edit mode

Hi Juke!

I'm trying to put the two GTFs together, but I've noticed that GTF 1 only has exons, so it doesn't count as having any genes, CDs... etc because it was created by a RNAseq of the hole genome.

enter image description here

What can I do to annotate it correctly? Because like this with functions agat_sp_complement_annotations.pl and agat_sp_merge_annotations.pl can not count genes.

Thanks!

ADD REPLY
0
Entering edit mode

Could you show few lines of the original file? You will probably need to create proper gene feature first using an appropriate attribute (using agat_convert_sp_gxf2gxf.pl with --locus_tag).

ADD REPLY
0
Entering edit mode

This is a RNA-seq of the animal, it is data from a paper, I only know that they created this from a RNAseq.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6095461/

When you talk about "Original file", are you talking about the Ensembl GTF reference genome?

enter image description here

How does --locus_tag work?

Thanks

ADD REPLY
0
Entering edit mode

I have done it with gffread -E --keep-genes

And it work it!

Does you recommend me to tranform all my files to Perl format with agat_convert_sp_gxf2gxf.pl?, or can I merge them without transforming the data?

Can I merge the Refseq GFF with the Genbank GFF? Or should I transform them into the same format?

Thanks for your help! This tool is amazing :)

ADD REPLY
1
Entering edit mode

I think gffread does not deal with all feature types, e.g. 3utr, 5utr, start_codon, stop_codon, (ncRNA?) etc. will be removed. If you just need gene,transcipt,exon,cds it should be fine.

You can merge any GTF/GFF annotation together with AGAT. The first step of most of the AGAT tools is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). If you need to use the --locus_tag option you will have to process the file agat_convert_sp_gxf2gxf.pl first before running any other tool.

The only problem that might occur is when you have a file containing only level3 feature (level1=gene, level2=mRNA,level3=CDS,exon,UTR,etc ) and no relationship described. That might be the case you described previously. Lets look into details:

  • Case1 only feature level3 but ID/Parent(GFF) or gene_id/transcript_id(GTF) relationship:

Input:

chr12   HAVANA  exon    100 500 .   +   .   ID=exon1;Parent=transcript1  
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1;Parent=transcript1
chr12   HAVANA  exon    700 900 .   +   .   ID=exonb;Parent=transcriptb  
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-b;Parent=transcriptb

Output:

chr12   HAVANA  gene    100 500 .   +   .   ID=gene1 
chr12   HAVANA  mRNA    100 500 .   +   .   ID=transcript1;Parent=gene1
chr12   HAVANA  exon    100 500 .   +   .   ID=exon1;Parent=transcript1
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1;Parent=transcript1
chr12   HAVANA  gene    700 900 .   +   .   ID=geneb 
chr12   HAVANA  mRNA    700 900 .   +   .   ID=transcriptb;Parent=geneb 
chr12   HAVANA  exon    700 900 .   +   .   ID=exon1;Parent=transcriptb
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-1;Parent=transcriptb

=> Everything is fine

  • Case2 only feature level3 but relationship made with a comon tag (e.g. locus_tag or gene_id by defaut):

Input:

chr12   HAVANA  exon    100 500 .   +   .   ID=exon1;gene_id=gene1 
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1;gene_id=gene1
chr12   HAVANA  exon    700 900 .   +   .   ID=exonb;gene_id=geneb 
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-b;gene_id=geneb

Output:

chr12   HAVANA  gene    100 500 .   +   .   ID=gene1;gene_id=gene1 
chr12   HAVANA  mRNA    100 500 .   +   0   ID=transcript1;Parent=geneb;gene_id=gene1
chr12   HAVANA  exon    100 500 .   +   .   ID=exon1;Parent=transcript1;gene_id=gene1 
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1;Parent=transcript1;gene_id=gene1
chr12   HAVANA  gene    700 900 .   +   .   ID=geneb;gene_id=geneb 
chr12   HAVANA  mRNA    700 900 .   +   0   ID=transcriptb;Parent=geneb;gene_id=geneb
chr12   HAVANA  exon    700 900 .   +   .   ID=exonb;Parent=transcriptb;gene_id=geneb 
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-b;Parent=transcriptb;gene_id=geneb

=> Everything is fine. If the comon tag is not gene_id or locus_tag, you have to inform AGAT which attribute to use to group features together

  • Case3 only feature level3 but no ID/Parent relationship and no comon locus tag:

Input:

chr12   HAVANA  exon    100 500 .   +   .   ID=exon1
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1
chr12   HAVANA  exon    700 900 .   +   .   ID=exonb
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-b

Output:

chr12   HAVANA  gene    100 900 .   +   .   ID=metaGENE
chr12   HAVANA  mRNA    100 900 .   +   .   ID=metamRNA;Parent=metaGENE
chr12   HAVANA  exon    100 500 .   +   .   ID=exon1;Parent=metamRNA
chr12   HAVANA  CDS 100 500 .   +   0   ID=cds-1;Parent=metamRNA
chr12   HAVANA  exon    700 900 .   +   .   ID=exonb;Parent=metamRNA
chr12   HAVANA  CDS 700 900 .   +   0   ID=cds-b;Parent=metamRNA

=> This is WRONG. All features level 3 will be attached to a uniq gene.

So, I hope you didn't end up in this case3 which is very rare (It should not exists because this type of input is actually not a GFF or GTF but I have already seen it).

ADD REPLY
0
Entering edit mode

Actually is the last one, I changed the file with agat_convert_sp_gxf2gxf.pl --g MergedWalsh.gtf -c locus_tag -o MergedWalsh_perl.gtf

But:

enter image description here

With gffread only the genes/transcripts were created and that's enough.

Now I am trying to merge them with agat_sp_merge_annotations.pl but this is happening:

enter image description here

The gene is the same (with same coord), but it is duplicated. ¿Why is happening this?

ADD REPLY
1
Entering edit mode

locus_tag and gene_id are already used by default by AGAT. No need to do -c locus_tag, use '-c' only if you want to use another attribute as comon tag.

The output you show from agat_sp_merge_annotations.pl is unexpected. Could you send me over the two files you used as input? Or only the part containing those records?

ADD REPLY
0
Entering edit mode

Yes! I have just send it to you :)

ADD REPLY
1
Entering edit mode

Ok I see the problem. Merge locus works if CDS of a gene overlaps the CDS of another gene, or if an exon of a gene (that do not hane any CDS) overlaps the exon of another gene (that do not have CDS neither). But In your case you have one gene with CDS and the other gene wihtout CDS but only exon.

I will have to update that, it should be merged but only if the level2 is the same for both loci (here transcript). I will open an issue in the repo.

ADD REPLY
1
Entering edit mode

I saw another thing that explain why AGATdidn't succeed to deal with your cufflinks file properly. It detects two possible version version of the GFF3 and GFF2, and decide to use GFF3 parser while it should use GFF2. It can be fixed using --gvi 2

ADD REPLY
0
Entering edit mode

Okay thanks! We will keep in contact by the repo/email!

ADD REPLY
1
Entering edit mode

It is fixed with other improvments in version 0.6.2 and it is available in conda

ADD REPLY
0
Entering edit mode

Hi Juke,

I have tried it, and now it only counts as gene one of the two if it is duplicated in coordinates (now it counts the real genes that we have), however, it keeps leaving it as transcripts the 2, and my goal is that if these two transcripts are duplicated and they agree on the coordinates, that in the final file there were not 2 transcribed with the same coordinates. Any suggestion?

enter image description here

Thank you so much for everything!

ADD REPLY
1
Entering edit mode

I had a try like that:
agat_sp_merge_annotations.pl -f MergedWalsh.gtf -f Mustela_putorius_furo.MusPutFur1.0.100_with_607genes.gtf -o resut_merged.gtf

and I have this result:

GL897338.1  ensembl gene    69095   70039   .   -   .   ID=ENSMPUG00000019557;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1
GL897338.1  ensembl transcript  69095   70039   .   -   .   ID=ENSMPUT00000019709;Parent=ENSMPUG00000019557;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl exon    69095   70039   .   -   .   ID=ENSMPUE00000195156;Parent=ENSMPUT00000019709;exon_id=ENSMPUE00000195156;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl CDS 69095   70039   .   -   0   ID=cds-188326;Parent=ENSMPUT00000019709;exon_number=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;protein_id=ENSMPUP00000019429;protein_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  Cufflinks   gene    69095   70039   .   -   .   ID=XLOC_038707;class_code=%3D;exon_number=1;gene_id=XLOC_038707;gene_name=ENSMPUG00000019557;nearest_ref=ENSMPUT00000019709;oId=ENSMPUT00000019709;p_id=P19523;transcript_id=TCONS_00071361;tss_id=TSS52175
GL897338.1  Cufflinks   RNA 69095   70039   .   -   .   ID=TCONS_00071361;Parent=XLOC_038707;class_code=%3D;exon_number=1;gene_id=XLOC_038707;gene_name=ENSMPUG00000019557;nearest_ref=ENSMPUT00000019709;oId=ENSMPUT00000019709;p_id=P19523;transcript_id=TCONS_00071361;tss_id=TSS52175
GL897338.1  Cufflinks   exon    69095   70039   .   -   .   ID=exon-585765;Parent=TCONS_00071361;class_code=%3D;exon_number=1;gene_id=XLOC_038707;gene_name=ENSMPUG00000019557;nearest_ref=ENSMPUT00000019709;oId=ENSMPUT00000019709;p_id=P19523;transcript_id=TCONS_00071361;tss_id=TSS52175

The thing here is that RNA and transcript is seen as two different level2 features, so it is not merged together. RNA is created by AGAT. I should probably modify it to call it transcript (but we would have mRNA we would have same problem that avoid feature to be merged, they must be called the same way.).

So you should first parse the cufflinks output MergedWalsh.gtf with agat_convert_sp_gxf2gxf.pl:

agat_convert_sp_gxf2gxf.pl --gff MergedWalsh.gtf -o MergedWalsh_clean.gff

then replace RNA from the third column of the output by transcript (awk '{if($3=="RNA"){$3="transcript"} print $0}' MergedWalsh_clean.gff > MergedWalsh_clean_transcript.gff) and rerun the merge script using MergedWalsh_output_transcript.gff instead of MergedWalsh.gtf.

ADD REPLY
0
Entering edit mode

First I have done gffread -E --keep-genes MergedWalsh.gtf -o- > MergedWalsh_annotated.gff3 to convert MergedWalsh.gtf like this:

AEYP01108159.1  Cufflinks   exon    14353   14567   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    14757   15175   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    15562   15676   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    16759   17191   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    17285   17364   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    18063   18157   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    18716   18795   .   -   .   gene_id 

To this:

AEYP01108159.1  Cufflinks   gene    14353   22773   .   -   .   ID=XLOC_000001;Name=MAGED2  
AEYP01108159.1  Cufflinks   transcript  14353   22773   .   -   .   ID=TCONS_00000005;Parent=XLOC_000001    
AEYP01108159.1  Cufflinks   exon    14353   14567   .   -   .   Parent=TCONS_00000005    
AEYP01108159.1  Cufflinks   exon    14757   15175   .   -   .   Parent=TCONS_00000005    
AEYP01108159.1  Cufflinks   exon    15562   15676   .   -   .   Parent=TCONS_00000005    
AEYP01108159.1  Cufflinks   exon    16759   16821   .   -   .   Parent=TCONS_00000005

And after that I did agat_convert_sp_gxf2gxf.pl --g MergedWalsh_annotated.gff3 -o MergedWalsh_perl.gff

Giving me this result:

GL898388.1  Cufflinks   gene    3767    4268    .   .   .   ID=XLOC_039708
GL898388.1  Cufflinks   transcript  3767    4268    .   .   .   ID=TCONS_00072714;Parent=XLOC_039708
GL898388.1  Cufflinks   exon    3767    4268    .   .   .   ID=exon-590982;Parent=TCONS_00072714
AEYP01111710.1  Cufflinks   gene    21  3672    .   -   .   ID=XLOC_000075;Name=ENSMPUG00000009623
AEYP01111710.1  Cufflinks   transcript  21  3672    .   -   .   ID=TCONS_00000109;Parent=XLOC_000075
AEYP01111710.1  Cufflinks   exon    21  93  .   -   .   ID=exon-408;Parent=TCONS_00000109
AEYP01111710.1  Cufflinks   exon    575 685 .   -   .   ID=exon-409;Parent=TCONS_00000109
AEYP01111710.1  Cufflinks   exon    3597    3672    .   -   .   ID=exon-410;Parent=TCONS_00000109

And after transforming the GTF reference using agat_convert_sp_gxf2gxf.pl --g Mustela_putorius_furo.MusPutFur1.0.100_with_607genes.gtf -o genome_with_607genes_perl.gff

I merged both using agat_sp_merge_annotations.pl --gff MergedWalsh_perl.gff --gff genome_with_607genes_perl.gff --out Walsh_with_607_genes_final2, giving me the result that I mentioned in the answer before.

Look at the final result, it does not have 'RNA' in level2 features. Now the "gene" level2 feature is not duplicated, but the transcript it is still the same.

GL897338.1  ensembl gene    69095   70039   .   -   .   ID=ENSMPUG00000019557;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1
GL897338.1  ensembl transcript  69095   70039   .   -   .   ID=ENSMPUT00000019709;Parent=ENSMPUG00000019557;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl exon    69095   70039   .   -   .   ID=ENSMPUE00000195156;Parent=ENSMPUT00000019709;exon_id=ENSMPUE00000195156;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl CDS 69095   70039   .   -   0   ID=cds-188326;Parent=ENSMPUT00000019709;exon_number=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;protein_id=ENSMPUP00000019429;protein_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  Cufflinks   transcript  69095   70039   .   -   .   ID=TCONS_00071361;Parent=ENSMPUG00000019557
GL897338.1  Cufflinks   exon    69095   70039   .   -   .   ID=exon-585762;Parent=TCONS_00071361
GL897338.1  Cufflinks   gene    365401  365869  .   .   .   ID=XLOC_038708
GL897338.1  Cufflinks   transcript  365401  365869  .   .   .   ID=TCONS_00071362;Parent=XLOC_038708
GL897338.1  Cufflinks   exon    365401  365869  .   .   .   ID=exon-585763;Parent=TCONS_00071362

I have send it to you all of these files if you want to check something :)

Thank youuke

ADD REPLY
1
Entering edit mode

Ok i see this is the expected behavior:

The transcript

GL897338.1  ensembl transcript  69095   70039   .   -   .   ID=ENSMPUT00000019709;Parent=ENSMPUG00000019557;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl exon    69095   70039   .   -   .   ID=ENSMPUE00000195156;Parent=ENSMPUT00000019709;exon_id=ENSMPUE00000195156;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1
GL897338.1  ensembl CDS 69095   70039   .   -   0   ID=cds-188326;Parent=ENSMPUT00000019709;exon_number=1;gene_biotype=protein_coding;gene_id=ENSMPUG00000019557;gene_source=ensembl;gene_version=1;protein_id=ENSMPUP00000019429;protein_version=1;transcript_biotype=protein_coding;transcript_id=ENSMPUT00000019709;transcript_source=ensembl;transcript_version=1

is different of:

GL897338.1  Cufflinks   transcript  69095   70039   .   -   .   ID=TCONS_00071361;Parent=ENSMPUG00000019557
GL897338.1  Cufflinks   exon    69095   70039   .   -   .   ID=exon-585762;Parent=TCONS_00071361

Because the first one has a CDS.

I might add extra rules to detect identical isoforms. I will think at it.

ADD REPLY
0
Entering edit mode

It would be very useful (not only to detect identical isoforms but also to delete them), because in this case probably the second one also had a CDS, but as far as we started from a GTF with all exons MergedWalsh.gtf:

AEYP01108159.1  Cufflinks   exon    14353   14567   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    14757   15175   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    15562   15676   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    16759   17191   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    17285   17364   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";
AEYP01108159.1  Cufflinks   exon    18063   18157   .   -   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "MAGED2"; oId "CUFF.7.4"; nearest_ref "ENSMPUT00000008184"; class_code "j"; tss_id "TSS1";

And gffread only annotated the gene and transcript, we do not have the CDS, but probably it is the same transcript because the length of the transcript and the exon is the same.

For me it is okay because I wanted to increase the coverage of a RNA-seq, and have at the end more genes (the isoforms do not matter at all) but I wanted to know how was the best way to create the GFF.

Thank you and we will keep in contact if you add this extra tool to detect identical isoforms :)

Thank you and best regards,

Rafa

ADD REPLY
0
Entering edit mode

Hi Juke!

I am having problems using data from NCBI. I have these genes:

#gtf-version 2.2
#!genome-build MusPutFur1.0
#!genome-build-accession NCBI_Assembly:GCF_000215625.1
#!annotation-source NCBI Mustela putorius furo Annotation Release 101
NC_020638.1 RefSeq  gene    2743    3697    .   +   .   gene_id "ND1"; db_xref "GeneID:14841810"; gbkey "Gene"; gene "ND1"; gene_biotype "protein_coding"; 
NC_020638.1 RefSeq  CDS 2743    3694    .   +   0   gene_id "ND1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND1"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 1"; protein_id "YP_007625250.1"; transl_except "(pos:3697..3697,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  start_codon 2743    2745    .   +   0   gene_id "ND1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND1"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 1"; protein_id "YP_007625250.1"; transl_except "(pos:3697..3697,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  stop_codon  3695    3697    .   +   0   gene_id "ND1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND1"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 1"; protein_id "YP_007625250.1"; transl_except "(pos:3697..3697,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  gene    3909    4950    .   +   .   gene_id "ND2"; db_xref "GeneID:14841798"; gbkey "Gene"; gene "ND2"; gene_biotype "protein_coding"; 
NC_020638.1 RefSeq  CDS 3909    4947    .   +   0   gene_id "ND2"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND2"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 2"; protein_id "YP_007625251.1"; transl_except "(pos:4950..4950,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  start_codon 3909    3911    .   +   0   gene_id "ND2"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND2"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 2"; protein_id "YP_007625251.1"; transl_except "(pos:4950..4950,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  stop_codon  4948    4950    .   +   0   gene_id "ND2"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND2"; note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; product "NADH dehydrogenase subunit 2"; protein_id "YP_007625251.1"; transl_except "(pos:4950..4950,aa:TERM)"; transl_table "2"; 
NC_020638.1 RefSeq  gene    5343    6887    .   +   .   gene_id "COX1"; db_xref "GeneID:14841799"; gbkey "Gene"; gene "COX1"; gene_biotype "protein_coding"; 
NC_020638.1 RefSeq  CDS 5343    6884    .   +   0   gene_id "COX1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "COX1"; product "cytochrome c oxidase subunit I"; protein_id "YP_007625252.1"; transl_table "2"; 
NC_020638.1 RefSeq  start_codon 5343    5345    .   +   0   gene_id "COX1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "COX1"; product "cytochrome c oxidase subunit I"; protein_id "YP_007625252.1"; transl_table "2"; 
NC_020638.1 RefSeq  stop_codon  6885    6887    .   +   0   gene_id "COX1"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "COX1"; product "cytochrome c oxidase subunit I"; protein_id "YP_007625252.1"; transl_table "2"; 

I am trying to agat_convert_sp_gxf2gxf.pl --gff mitochondrial_genes.gtf -o genes_mitocondriales_perl.gff but i happens the same, look:

##gff-version 3
#gtf-version 2.2
#!genome-build MusPutFur1.0
#!genome-build-accession NCBI_Assembly:GCF_000215625.1
#!annotation-source NCBI Mustela putorius furo Annotation Release 101
NC_020638.1 RefSeq  gene    2743    15309   .   +   .   ID=ND3;db_xref=GeneID:14841804;gbkey=Gene;gene=ND3;gene_biotype=protein_coding;gene_id=ND3
NC_020638.1 RefSeq  RNA 2743    15309   .   +   .   ID=unknown_transcript_1;Parent=ND3;db_xref=GeneID:14841804;gbkey=Gene;gene=ND3;gene_biotype=protein_coding;gene_id=ND3
NC_020638.1 RefSeq  exon    2743    3697    .   +   0   ID=exon-1;Parent=unknown_transcript_1;gbkey=exon;gene=ND1;gene_id=ND1;note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;product=NADH dehydrogenase subunit 1;protein_id=YP_007625250.1;transcript_id=unknown_transcript_1;transl_except=(pos:3697..3697%2Caa:TERM);transl_table=2
NC_020638.1 RefSeq  exon    3909    4950    .   +   0   ID=exon-2;Parent=unknown_transcript_1;gbkey=exon;gene=ND2;gene_id=ND2;note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;product=NADH dehydrogenase subunit 2;protein_id=YP_007625251.1;transcript_id=unknown_transcript_1;transl_except=(pos:4950..4950%2Caa:TERM);transl_table=2
NC_020638.1 RefSeq  exon    5343    6887    .   +   0   ID=exon-3;Parent=unknown_transcript_1;gbkey=exon;gene=COX1;gene_id=COX1;product=cytochrome c oxidase subunit I;protein_id=YP_007625252.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  exon    7026    7709    .   +   0   ID=exon-4;Parent=unknown_transcript_1;gbkey=exon;gene=COX2;gene_id=COX2;product=cytochrome c oxidase subunit II;protein_id=YP_007625253.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  exon    7781    8624    .   +   0   ID=exon-6;Parent=unknown_transcript_1;gbkey=exon;gene=ATP6;gene_id=ATP6;product=ATP synthase F0 subunit 6;protein_id=YP_007625255.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  exon    8620    9405    .   +   0   ID=exon-7;Parent=unknown_transcript_1;gbkey=exon;gene=COX3;gene_id=COX3;note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;product=cytochrome c oxidase subunit III;protein_id=YP_007625256.1;transcript_id=unknown_transcript_1;transl_except=(pos:9405..9405%2Caa:TERM);transl_table=2
NC_020638.1 RefSeq  exon    9475    9821    .   +   0   ID=exon-8;Parent=unknown_transcript_1;gbkey=exon;gene=ND3;gene_id=ND3;note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;product=NADH dehydrogenase subunit 3;protein_id=YP_007625257.1;transcript_id=unknown_transcript_1;transl_except=(pos:9820..9821%2Caa:TERM);transl_table=2
....
NC_020638.1 RefSeq  stop_codon  10184   10186   .   +   0   ID=stop_codon-9;Parent=unknown_transcript_1;gbkey=exon;gene=ND4L;gene_id=ND4L;product=NADH dehydrogenase subunit 4L;protein_id=YP_007625258.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  stop_codon  11555   11557   .   +   0   ID=stop_codon-10;Parent=unknown_transcript_1;gbkey=exon;gene=ND4;gene_id=ND4;note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;product=NADH dehydrogenase subunit 4;protein_id=YP_007625259.1;transcript_id=unknown_transcript_1;transl_except=(pos:11557..11557%2Caa:TERM);transl_table=2
NC_020638.1 RefSeq  stop_codon  13563   13565   .   -   0   ID=stop_codon-12;Parent=unknown_transcript_1;gbkey=exon;gene=ND6;gene_id=ND6;product=NADH dehydrogenase subunit 6;protein_id=YP_007625261.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  stop_codon  13577   13579   .   +   0   ID=stop_codon-11;Parent=unknown_transcript_1;gbkey=exon;gene=ND5;gene_id=ND5;product=NADH dehydrogenase subunit 5;protein_id=YP_007625260.1;transcript_id=unknown_transcript_1;transl_table=2
NC_020638.1 RefSeq  stop_codon  15307   15309   .   +   0   ID=stop_codon-13;Parent=unknown_transcript_1;gbkey=exon;gene=CYTB;gene_id=CYTB;product=cytochrome b;protein_id=YP_007625262.1;transcript_id=unknown_transcript_1;transl_table=2

I have only one gene. As well, using the tool for the entire NCBI reference GTF, counted less genes than the reference had:

awk '{print $3}' GCF_000215625.1_MusPutFur1.0_genomic.gtf | sort | uniq -c
      4 
 587412 CDS
 667374 exon
  27504 gene
      1 Mustela
  48365 start_codon
  48339 stop_codon

And with the reference of NCBI with agat_sp_statistics.pl --gff GCF_000215625.1_MusPutFur1.0_genomic.gtf -o NCBI_mitochondrial_perl_stat

--------------------------------------------------------------------------------

Compute mrna with isoforms if any

Number of genes                              19681
Number of mrnas                              48095
Number of mrnas with utr both sides          44009
Number of mrnas with at least one utr        46661
Number of cdss                               48095

What can be

ADD REPLY
0
Entering edit mode

Gene_id is not unique within the file. So you merge genes together. You must parse your file without looking at any relationship but grouping feature I the order they appear (sequentially). To do that you must remove the use of a common tag value. Add “--ct fake” to your command.

ADD REPLY
0
Entering edit mode

Hi Juke,

I have opened a new topic with a new question.

Merge 2 GTF using agat avoiding overlapping

ADD REPLY
1
Entering edit mode
3.6 years ago
Michael 55k

I made this perl script to compare two GFF files. It might be useful after some modifications (changing GFF version to 2?). Needs BioPerl installed.

Useage: gffdiff.pl file1.gff, gile2.gff

ADD COMMENT

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6