Protein filtering for annotation
2
0
Entering edit mode
3.2 years ago
Ric ▴ 440

Hi, I downloaded 1009 proteins from Genbank. After the below filtering I end up with 663 ammino acids sequences:

$ grep ">" NbenthamianaGenbankAA.fasta | grep -v partial | grep -v like | grep -v unnamed | wc -l
663

However, I noticed many identical descriptions but with different IDs and sequence lengths.

How do I choose the protein from the above example and are there any better filtering steps?

Thank you in advance,

genbank annotation gene • 1.1k views
ADD COMMENT
0
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 28k

Sequences with identical descriptions but different lengths are likely orthologs from different species.

You can filter them by sequence identity using CD-HIT. If you select a 90% identity cutoff, it will cluster together all the sequences that share 90+% identity, and keep a single (and longest) sequence from that cluster. That will likely take care of your partial sequences without having to grep them out.

It is up to you to select whatever identity threshold makes most sense. More sequences will be retained at higher cutoffs.

ADD COMMENT
0
Entering edit mode
3.2 years ago

Different proteins (IDs) with the same description is not uncommon (more common than uncommon even). These are part of the same gene family. In eukaryotes, especially higher eukaryotes, this is a very likely scenario (gene duplication etc will increase the number of genes) , they have many paralogs (== genes with often similar/identical function) in their genomes.

Well, there is no choosing possible. They are all real genes with that function, none of them is more likely than the next one to perform that biological function

ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6