Wish you all have a great weekend. I am trying to generate a non-redundant collection of human protein-coding genes from the human reference genome (e.g., ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_Y/hs_alt_HuRef_chrY.gbk.gz). As we know there can be several protein isoforms associated with one gene. This resulted in a total number of 60,000 proteins. I don't want this many. I just need any one (ideally the longest one) out of the several isoforms per gene. I need the NCBI accession numbers (NP_xxxxxx) of them. Is there anyway I can achieve that? Thank you!
So what is your question? "Is there anyway I can achieve that?" If so, the answer is: "Yes, you can" :)
Ok, but now serious stuff. As your link was only an example I wrote this "Extracting longest CDS" for the Gencode v18 data.
curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz"|
gunzip -c |awk'BEGIN {OFS="\t"}; ($3 ~/CDS/ && $14 ~ /protein_coding/) {print $10,$12,$5-$4}'|tee Transcript_CDS |# gene_id transcript_id CDS_lengthcut -f 1 |sort -u \
> Gene_id_list # Total gene_id listwhile read GENE;dogrep -w "$GENE" Transcript_CDS |tee Gene_data |# For every gene_id extract transcripts & CDS (makes it easier in the following steps) cut -f 2 |sort -u \
> Transcript_id_list # Total transcript_id list (for $GENE)while read TRANSCRIPT;doawk -v TRANSCRIPT="$TRANSCRIPT"'($2==TRANSCRIPT) {LENGTH+=$3} END {print LENGTH"\t"TRANSCRIPT}' Gene_data # For every $TRANSCRIPT sum CDS length done< Transcript_id_list |sort -nr -k1 |head -n1 |cut -f 2 \
>> Longest_CDS_list # Transcript_id's that have longest CDSdone< Gene_id_list
This will only get you transcript_id that has longest CDS (compared to other transcripts). It's not perfect (but you only have to run it once). I also want some feedback from the community on it as I am only starting writing this kind of things.
What you can do with this "Longest_CDS_list"?:
Extract transcript info from the original gtf;
Extract translated sequence from here;
Convert them to NCBI accesion number (my answer doesn't cover this part).
Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!
Later I worked out a Perl script to do the job. It's a quite verbose piece of script. It works on the human chromosome Y as I posted in this thread. I recommend Pgibas's answer which is definitely easier. Here is mine:
#!/usr/bin/perl
use warnings;
use strict;
my $current_gene="";
my $current_max_size;
my $current_max_protein;while(<>){
chomp;if(/^ gene /){if($current_gene and $current_max_size){
print "$current_max_protein\t$current_max_size\t$current_gene\n";}
my $s=<>;
chomp $s;if($s=~ /^\s+\/gene="(.+)"$/){$current_gene=$1;$current_max_size= 0;$current_max_protein="";
next;}}if(/^ CDS (.+)$/){
my $region=$1;while(1){
my $s=<>;
last if$s=~ /\s+\//;
chomp $s;$region .=$s;}$region=~ s/[^0-9\.,]//g;
my $size= 0;
my @a =split(/,/, $region);
foreach (@a){if(/(\d+)\.\.(\d+)/){$size += abs($1-$2)+1;}}
my $protein="";while(1){
my $s=<>;
chomp $s;if($s=~ /^\s+\/protein_id="(.+)"$/){$protein=$1;
last;}}
print " $protein: $size\n";if($size>$current_max_size){$current_max_size=$size;$current_max_protein=$protein;}
next;}}if($current_gene and $current_max_size){
print "$current_max_protein\t$current_max_size\t$current_gene\n";}
Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!