Question

select longest isoform from RefSeq proteome fasta

0

Entering edit mode

5.4 years ago

wd • 0

Hi

I would like to extract the longest isoform from a RefSeq proteome fasta. I looked at suggested solutions on this and other fora but I think none of these would work with a RefSeq proteome fasta.

For example, I would like to select longest isoforms from this fasta: ftp://ftp.ncbi.nih.gov/genomes/Hyalella_azteca/protein/

Headers for this RefSeq proteome look as follows:

gi|1067085875|ref|XP_018026876.1| PREDICTED: calcineurin subunit B type 2 isoform X3 [Hyalella azteca]
gi|1067085873|ref|XP_018026875.1| PREDICTED: calcineurin subunit B type 2 isoform X2 [Hyalella azteca]
gi|1067085871|ref|XP_018026874.1| PREDICTED: calcineurin subunit B type 2 isoform X1 [Hyalella azteca]
gi|1067085879|ref|XP_018026878.1| PREDICTED: calcineurin subunit B type 2 isoform X5 [Hyalella azteca]
gi|1067085877|ref|XP_018026877.1| PREDICTED: calcineurin subunit B type 2 isoform X4 [Hyalella azteca]

Anyone an idea? Any suggestions is much appreciated!

RefSeq fasta proteome isoform • 3.8k views

ADD COMMENT • link updated 5.1 years ago by vkkodali_ncbi ★ 3.8k • written 5.4 years ago by wd • 0

score 1 · Answer 1 · 2019-07-09

Hi wd,

The idea is to first find a relationship mapping between each protein and its parent gene, then combine the mapping with protein length, followed by selecting the longest protein id based on length grouped by gene id.

Try (make sure to have seqkit installed):

curl -s ftp://ftp.ncbi.nih.gov/genomes/Hyalella_azteca/GFF/ref_Hazt_2.0_top_level.gff3.gz \
  | zcat - \
  | awk '$3=="CDS"{gsub(".+;Dbxref=GeneID:", "", $9); gsub(";Name=.+", "", $9); gsub(",Genbank:", "\t", $9); print $9}' \
  | sort -k2,2 \
  | uniq \
  > gene_protein.tsv

seqkit fx2tab -n -l protein.fa \
  | awk 'BEGIN{FS="\t"}{gsub(".+ref\\|", "", $1); gsub("\\|.*$", "", $1); print}' \
  | sort \
  > protein_len.tsv

join -1 2 gene_protein.tsv -2 1 protein_len.tsv \
  | awk 'len[$2]<$3 {len[$2]=$3; id[$2]=$1} END {for (i in id) {print id[i]}}' \
  > protein_longest.txt

seqkit grep -r -n -f protein_longest.txt protein.fa \
  > protein_longest.fa

In your example, gene_protein.tsv, protein_len.tsv, and protein_longest.txt would be:

$ grep 'XP_018026874.1\|XP_018026875.1\|XP_018026876.1\|XP_018026877.1\|XP_018026878.1' gene_protein.tsv protein_len.tsv protein_longest.txt
gene_protein.tsv:108682253  XP_018026874.1
gene_protein.tsv:108682253  XP_018026875.1
gene_protein.tsv:108682253  XP_018026876.1
gene_protein.tsv:108682253  XP_018026877.1
gene_protein.tsv:108682253  XP_018026878.1
protein_len.tsv:XP_018026874.1   213
protein_len.tsv:XP_018026875.1   207
protein_len.tsv:XP_018026876.1   204
protein_len.tsv:XP_018026877.1   179
protein_len.tsv:XP_018026878.1   173
protein_longest.txt:XP_018026874.1

score 0 · Answer 2 · 2019-10-23

I suggest you use the feature_table.txt.gz file for your needs. This file is located in the NCBI genomes FTP path: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/764/305/GCF_000764305.1_Hazt_2.0/

You can use awk to do this as shown below:

cat GCF_000764305.1_Hazt_2.0_feature_table.txt \
  | awk 'BEGIN{FS="\t";OFS="\t"}($1=="CDS"){print $11,$13,$16,$19}' \
  | sort -k3,3 -k4,4nr \
  | awk 'BEGIN{FS="\t";OFS="\t";l=0;g=0}{if($3==g) {if ($4>=l) {print $0; g=$3; l=$4}} else {print $0; g=$3; l=$4} }'

Note, it is not unusual to have two isoforms encoding proteins of the same size. In such instance, my code returns all of those proteins.