Intersect gene IDs list and GFF3 to get the annotation.
3
0
Entering edit mode
8 months ago

Hi

I need to get the annotation for a list of IDs (Just one column). It looks like this:

GeneID
Tt.00g000720
Tt.00g000730
Tt.00g000780
Tt.00g000850
Tt.00g000950
Tt.00g001260
Tt.00g001550

I have a gff for those genes IDs like this:

contig_101  GenSAS_65ea36c7c3784-publish    gene    319 1590    .   +   .   ID=Tt.00g000010;Name=Tt.00g000010;original_ID=75803-Tt.00g000010;Alias=75803-Tt.00g000010;original_name=75803-Tt.00g000010;Notes=XP_028877667.1 [BLAST protein vs protein (blastp) 2.12.0],XP_009314224.1 [DIAMOND Functional 2.0.11],L-fucokinase (IPR012887) [InterProScan 5.53-87.0],PF07959.7 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    319 1590    .   +   .   ID=Tt.00g000010.m01;Name=Tt.00g000010.m01;Parent=Tt.00g000010;original_ID=75803-Tt.00g000010.m01;Alias=75803-Tt.00g000010.m01;original_name=75803-Tt.00g000010
contig_101  GenSAS_65ea36c7c3784-publish    exon    319 1590    .   +   .   ID=Tt.00g000010.m01.exon01;Name=Tt.00g000010.m01.exon01;Parent=Tt.00g000010.m01;original_ID=75803-Tt.00g000010.m01.exon1;Alias=75803-Tt.00g000010.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 319 1590    .   +   0   ID=Tt.00g000010.m01.CDS01;Name=Tt.00g000010.m01.CDS01;Parent=Tt.00g000010.m01;original_ID=cds.75803-Tt.00g000010.m01;Alias=cds.75803-Tt.00g000010.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    1726    3468    .   +   .   ID=Tt.00g000020;Name=Tt.00g000020;original_ID=75803-Tt.00g000020;Alias=75803-Tt.00g000020;original_name=75803-Tt.00g000020;Notes=XP_028877667.1 [BLAST protein vs protein (blastp) 2.12.0],XP_028877667.1 [DIAMOND Functional 2.0.11],Galactokinase/homoserine kinase (IPR001174) [InterProScan 5.53-87.0],PF00288.21 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    1726    3468    .   +   .   ID=Tt.00g000020.m01;Name=Tt.00g000020.m01;Parent=Tt.00g000020;original_ID=75803-Tt.00g000020.m01;Alias=75803-Tt.00g000020.m01;original_name=75803-Tt.00g000020
contig_101  GenSAS_65ea36c7c3784-publish    exon    1726    3468    .   +   .   ID=Tt.00g000020.m01.exon01;Name=Tt.00g000020.m01.exon01;Parent=Tt.00g000020.m01;original_ID=75803-Tt.00g000020.m01.exon1;Alias=75803-Tt.00g000020.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 1726    3468    .   +   0   ID=Tt.00g000020.m01.CDS01;Name=Tt.00g000020.m01.CDS01;Parent=Tt.00g000020.m01;original_ID=cds.75803-Tt.00g000020.m01;Alias=cds.75803-Tt.00g000020.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    4054    4560    .   +   .   ID=Tt.00g000030;Name=Tt.00g000030;original_ID=75803-Tt.00g000030;Alias=75803-Tt.00g000030;original_name=75803-Tt.00g000030;Notes=XP_803186.1 [BLAST protein vs protein (blastp) 2.12.0],XP_803186.1 [DIAMOND Functional 2.0.11]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    4054    4560    .   +   .   ID=Tt.00g000030.m01;Name=Tt.00g000030.m01;Parent=Tt.00g000030;original_ID=75803-Tt.00g000030.m01;Alias=75803-Tt.00g000030.m01;original_name=75803-Tt.00g000030
contig_101  GenSAS_65ea36c7c3784-publish    exon    4054    4560    .   +   .   ID=Tt.00g000030.m01.exon01;Name=Tt.00g000030.m01.exon01;Parent=Tt.00g000030.m01;original_ID=75803-Tt.00g000030.m01.exon1;Alias=75803-Tt.00g000030.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 4054    4560    .   +   0   ID=Tt.00g000030.m01.CDS01;Name=Tt.00g000030.m01.CDS01;Parent=Tt.00g000030.m01;original_ID=cds.75803-Tt.00g000030.m01;Alias=cds.75803-Tt.00g000030.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    5050    6858    .   +   .   ID=Tt.00g000040;Name=Tt.00g000040;original_ID=75803-Tt.00g000040;Alias=75803-Tt.00g000040;original_name=75803-Tt.00g000040;Notes=XP_803185.1 [BLAST protein vs protein (blastp) 2.12.0],XP_807877.1 [DIAMOND Functional 2.0.11],Pyridoxal phosphate-dependent decarboxylase (IPR002129) [InterProScan 5.53-87.0],PF00282.14 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    5050    6858    .   +   .   ID=Tt.00g000040.m01;Name=Tt.00g000040.m01;Parent=Tt.00g000040;original_ID=75803-Tt.00g000040.m01;Alias=75803-Tt.00g000040.m01;original_name=75803-Tt.00g000040
contig_101  GenSAS_65ea36c7c3784-publish    exon    5050    6858    .   +   .   ID=Tt.00g000040.m01.exon01;Name=Tt.00g000040.m01.exon01;Parent=Tt.00g000040.m01;original_ID=75803-Tt.00g000040.m01.exon1;Alias=75803-Tt.00g000040.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 5050    6858    .   +   0   ID=Tt.00g000040.m01.CDS01;Name=Tt.00g000040.m01.CDS01;Parent=Tt.00g000040.m01;original_ID=cds.75803-Tt.00g000040.m01;Alias=cds.75803-Tt.00g000040.m01

How can I intersect the IDs with the GFF file to get the annotation for those IDs?. I have tried it using bedtools intersect but it didn't work because I don't have the annotation for those IDs.

Thank you very much for you help!

annotation GFF • 1.2k views
ADD COMMENT
2
Entering edit mode
8 months ago
Ram 44k

One simple way would be to use grep with a tiny bit of input modification:

grep -Ff <(tail -n +2 id_list.txt | sed -Ee 's/(.*)/ID=\1;/') file.gff
ADD COMMENT
0
Entering edit mode

Hey

Thanks for your answer. I tried it but I got an empty output. Could I make something wrong ?

ADD REPLY
0
Entering edit mode

Does your ID list overlap the ID= part of column 9 or a different part? Please provide a better example as there is no overlap between your ID list and any GFF entry.

What is the output to:

grep -m10 Tt.00g000720 file.gff
ADD REPLY
0
Entering edit mode

Hi, This is the result This is the results

ADD REPLY
0
Entering edit mode

You should not get empty output. Are you sure you're running the command right? What is the output to:

head id_list.txt | od -c
ADD REPLY
0
Entering edit mode

It seems like this

enter image description here

ADD REPLY
0
Entering edit mode

That file was created on Windows, correct? Please run dos2unix on it to make it a proper Linux file, then try the grep from the answer.

ADD REPLY
0
Entering edit mode

You were totally right, I change the format of txt created in windows and It worked. Thank you very much for your help!

ADD REPLY
1
Entering edit mode
8 months ago
Juke34 8.9k

With AGAT you can use agat_sp_filter_feature_from_keep_list.pl you can filter the records related to those IDs:

agat_sp_filter_feature_from_keep_list.pl --gff infile.gff --keep_list file.txt  [ --output outfile ]

Where your file.txt looks like that:

Tt.00g000720
Tt.00g000730
Tt.00g000780
Tt.00g000850
Tt.00g000950
Tt.00g001260
Tt.00g001550

Then I guess agat_convert_sp_gff2tsv.pl should give you a way to explore the annotations

ADD COMMENT
0
Entering edit mode
8 months ago

You can also try gffread, the usage instructions below suggest you need parameter --ids :

gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 
 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
 [<input_gff>] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.

Options:
 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>
 --nids discard records/transcripts if their IDs are listed in <IDs.lst>
 ...
ADD COMMENT

Login before adding your answer.

Traffic: 1588 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6