Question

How To Retain One Cds Out Of Multiple Isoforms In Human Genomic Data

0

Entering edit mode

11.8 years ago

qiyunzhu ▴ 430

Dear all,

Wish you all have a great weekend. I am trying to generate a non-redundant collection of human protein-coding genes from the human reference genome (e.g., ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_Y/hs_alt_HuRef_chrY.gbk.gz). As we know there can be several protein isoforms associated with one gene. This resulted in a total number of 60,000 proteins. I don't want this many. I just need any one (ideally the longest one) out of the several isoforms per gene. I need the NCBI accession numbers (NP_xxxxxx) of them. Is there anyway I can achieve that? Thank you!

Best,

genomics cds • 4.1k views

ADD COMMENT • link 11.8 years ago by qiyunzhu ▴ 430

Ram · Answer 1 · 2013-10-19

So what is your question? "Is there anyway I can achieve that?" If so, the answer is: "Yes, you can" :)

Ok, but now serious stuff. As your link was only an example I wrote this "Extracting longest CDS" for the Gencode v18 data.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz" |
   gunzip -c |
   awk 'BEGIN {OFS="\t"}; ($3 ~/CDS/ && $14 ~ /protein_coding/) {print $10,$12,$5-$4}' |
   tee Transcript_CDS |          # gene_id transcript_id CDS_length
   cut -f 1 | 
   sort -u \
> Gene_id_list                   # Total gene_id list

while read GENE; do
   grep -w "$GENE" Transcript_CDS |  
      tee Gene_data |            # For every gene_id extract transcripts & CDS (makes it easier in the following steps) 
      cut -f 2 | 
      sort -u \
   > Transcript_id_list          # Total transcript_id list (for $GENE)
   while read TRANSCRIPT; do
      awk -v TRANSCRIPT="$TRANSCRIPT" '($2==TRANSCRIPT) {LENGTH+=$3} END {print LENGTH"\t"TRANSCRIPT}'  Gene_data          # For every $TRANSCRIPT sum CDS length 
   done < Transcript_id_list |
      sort -nr -k1 |
      head -n1 |
      cut -f 2 \
   >> Longest_CDS_list           # Transcript_id's that have longest CDS
done < Gene_id_list

This will only get you transcript_id that has longest CDS (compared to other transcripts). It's not perfect (but you only have to run it once). I also want some feedback from the community on it as I am only starting writing this kind of things. What you can do with this "Longest_CDS_list"?:

Extract transcript info from the original gtf;
Extract translated sequence from here;
Convert them to NCBI accesion number (my answer doesn't cover this part).

Wish you have a great weekend too! :)

Ram · Answer 2 · 2013-10-26

Later I worked out a Perl script to do the job. It's a quite verbose piece of script. It works on the human chromosome Y as I posted in this thread. I recommend Pgibas's answer which is definitely easier. Here is mine:

#!/usr/bin/perl

use warnings;
use strict;

my $current_gene = "";
my $current_max_size;
my $current_max_protein;

while (<>){
    chomp;
    if (/^     gene            /){
        if ($current_gene and $current_max_size){
            print "$current_max_protein\t$current_max_size\t$current_gene\n";
        }
        my $s = <>;
        chomp $s;
        if ($s =~ /^\s+\/gene="(.+)"$/){
            $current_gene = $1;
            $current_max_size = 0;
            $current_max_protein = "";
            next;
        }
    }
    if (/^     CDS             (.+)$/){
        my $region = $1;
        while (1){
            my $s = <>;
            last if $s =~ /\s+\//;
            chomp $s;
            $region .= $s;
        }
        $region =~ s/[^0-9\.,]//g;
        my $size = 0;
        my @a = split (/,/, $region);
        foreach (@a){
            if (/(\d+)\.\.(\d+)/){
                $size += abs($1-$2)+1;
            }
        }
        my $protein = "";
        while (1){
            my $s = <>;
            chomp $s;
            if ($s =~ /^\s+\/protein_id="(.+)"$/){
                $protein = $1;
                last;
            }
        }
        print "  $protein: $size\n";
        if ($size > $current_max_size){
            $current_max_size = $size;
            $current_max_protein = $protein;
        }
        next;
    }
}
if ($current_gene and $current_max_size){
    print "$current_max_protein\t$current_max_size\t$current_gene\n";
}