adding features to gtf file using agat tool function
2
1
Entering edit mode
16 months ago
1769mkc ★ 1.2k

I m trying to use this agat which adds new attributes from tsv to gtf file.

My file formats are as such

input tsv which is my reference file

gene_id Entrez_ID
ENSCAFG00845006432 399518
ENSCAFG00845002136 399530
ENSCAFG00845029798 399544
ENSCAFG00845011460 399545
ENSCAFG00845001610 399653
ENSCAFG00845013158 403157
ENSCAFG00845014982 403168
ENSCAFG00845021967 403170
ENSCAFG00845019241 40340

Next one is my gtf file

#!genome-build ROS_Cfam_1.0
#!genome-version ROS_Cfam_1.0
#!genome-date 2020-09
#!genome-build-accession GCA_014441545.1
#!genebuild-last-updated 2020-10
X       ensembl gene    24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
X       ensembl transcript      24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_
source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
X       ensembl exon    24552206        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; tr
anscript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSCAFE00845128634"; exon_version "1"; tag "Ensembl_canonical";
X       ensembl CDS     24552206        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; tr
anscript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00845021332"; protein_version "1"; tag "Ensembl_canonical";
X       ensembl start_codon     24552224        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_cod
ing"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

Now this my command and its argument agat_sq_add_attributes_from_tsv.pl --gff Canis_lupus_familiaris.ROS_Cfam_1.0.108.gtf --tsv entrez_id_filtered.tsv -o test_v1.gtf

The head of the new gtf file

 cat test_v1.gtf | head
##gff-version 3
X       ensembl gene    24550462        24552226        .       -       .
X       ensembl transcript      24550462        24552226        .       -       .
X       ensembl exon    24552206        24552226        .       -       .
X       ensembl CDS     24552206        24552226        .       -       0
X       ensembl start_codon     24552224        24552226        .       -       0
X       ensembl exon    24550462        24551997        .       -       .
X       ensembl CDS     24550462        24551997        .       -       0
X       ensembl gene    24606240        24606309        .       -       .
X       ensembl transcript      24606240        24606309        .       -       .

Tail

 cat test_v1.gtf | tail
JAAUVH010000016.1       ensembl exon    4087    4232    .       -       .
JAAUVH010000221.1       ensembl gene    649     789     .       +       .
JAAUVH010000221.1       ensembl transcript      649     789     .       +       .
JAAUVH010000221.1       ensembl exon    649     789     .       +       .
JAAUVH010000128.1       ensembl gene    2862    3007    .       +       .
JAAUVH010000128.1       ensembl transcript      2862    3007    .       +       .
JAAUVH010000128.1       ensembl exon    2862    3007    .       +       .
JAAUVH010000325.1       ensembl gene    7802    7946    .       +       .
JAAUVH010000325.1       ensembl transcript      7802    7946    .       +       .
JAAUVH010000325.1       ensembl exon    7802    7946    .       +       .

Not sure about the output if this is how it should be because I don't see Entrez_ID tag in the new gtf file. Any suggestion would be really helpful

gtf • 1.5k views
ADD COMMENT
4
Entering edit mode
16 months ago
Juke34 8.9k

Actually there is a bug, in this script the autodetection of the input format is off, it uses GFF3 while your file is a GTF (GF2.5) so it does not work as expected... I will push a fix

ADD COMMENT
2
Entering edit mode

It is now fixed (master branch). It will be available through conda in the next release

ADD REPLY
0
Entering edit mode

thank you so very much i have tried all sorts of combination looking at the example still no desired output so I thought Im doing some mistake

ADD REPLY
3
Entering edit mode
16 months ago
Mark ★ 1.6k

I think your issue is this:

The first column does not become an attribute, indeed it must contain the feature ID that will be used to know to which feature we will add the attributes.

Your tsv file should be:

ID    Entrez_ID
399518     ENSCAFG00845006432 
399530     ENSCAFG00845002136

I assumed the second column is your unique identify that matches the gff file, and the Entrez_ID is the feature you want to add. Which ever way, the first column has to be ID.

ADD COMMENT
0
Entering edit mode

thank you let me run this and will update you it seems i got the data format totally wrong

ADD REPLY
0
Entering edit mode
ID  annot_type1  
gene1   annot_x  
cds1    annot_y  
* input.gff:

chr1    irgsp   gene    1000    2000    .   +   .   ID=gene1  
chr1    irgsp   CDS 2983    3268    .   +   .   ID=cds1  

based on the example if i understand the input tsv file contains column with a header named ID and the gff file also contain that tag ID

in my gtf file it is gene_id which doesn;t work. So i need to change to ID?

ADD REPLY

Login before adding your answer.

Traffic: 2392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6