Wish you all have a great weekend. I am trying to generate a non-redundant collection of human protein-coding genes from the human reference genome (e.g., ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_Y/hs_alt_HuRef_chrY.gbk.gz). As we know there can be several protein isoforms associated with one gene. This resulted in a total number of 60,000 proteins. I don't want this many. I just need any one (ideally the longest one) out of the several isoforms per gene. I need the NCBI accession numbers (NP_xxxxxx) of them. Is there anyway I can achieve that? Thank you!
So what is your question? "Is there anyway I can achieve that?" If so, the answer is: "Yes, you can" :)
Ok, but now serious stuff. As your link was only an example I wrote this "Extracting longest CDS" for the Gencode v18 data.
curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz" |
gunzip -c |
awk 'BEGIN {OFS="\t"}; ($3 ~/CDS/ && $14 ~ /protein_coding/) {print $10,$12,$5-$4}' |
tee Transcript_CDS | # gene_id transcript_id CDS_length
cut -f 1 |
sort -u \
> Gene_id_list # Total gene_id list
while read GENE; do
grep -w "$GENE" Transcript_CDS |
tee Gene_data | # For every gene_id extract transcripts & CDS (makes it easier in the following steps)
cut -f 2 |
sort -u \
> Transcript_id_list # Total transcript_id list (for $GENE)
while read TRANSCRIPT; do
awk -v TRANSCRIPT="$TRANSCRIPT" '($2==TRANSCRIPT) {LENGTH+=$3} END {print LENGTH"\t"TRANSCRIPT}' Gene_data # For every $TRANSCRIPT sum CDS length
done < Transcript_id_list |
sort -nr -k1 |
head -n1 |
cut -f 2 \
>> Longest_CDS_list # Transcript_id's that have longest CDS
done < Gene_id_list
This will only get you transcript_id that has longest CDS (compared to other transcripts). It's not perfect (but you only have to run it once). I also want some feedback from the community on it as I am only starting writing this kind of things.
What you can do with this "Longest_CDS_list"?:
Extract transcript info from the original gtf;
Extract translated sequence from here;
Convert them to NCBI accesion number (my answer doesn't cover this part).
Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!
Later I worked out a Perl script to do the job. It's a quite verbose piece of script. It works on the human chromosome Y as I posted in this thread. I recommend Pgibas's answer which is definitely easier. Here is mine:
#!/usr/bin/perl
use warnings;
use strict;
my $current_gene = "";
my $current_max_size;
my $current_max_protein;
while (<>){
chomp;
if (/^ gene /){
if ($current_gene and $current_max_size){
print "$current_max_protein\t$current_max_size\t$current_gene\n";
}
my $s = <>;
chomp $s;
if ($s =~ /^\s+\/gene="(.+)"$/){
$current_gene = $1;
$current_max_size = 0;
$current_max_protein = "";
next;
}
}
if (/^ CDS (.+)$/){
my $region = $1;
while (1){
my $s = <>;
last if $s =~ /\s+\//;
chomp $s;
$region .= $s;
}
$region =~ s/[^0-9\.,]//g;
my $size = 0;
my @a = split (/,/, $region);
foreach (@a){
if (/(\d+)\.\.(\d+)/){
$size += abs($1-$2)+1;
}
}
my $protein = "";
while (1){
my $s = <>;
chomp $s;
if ($s =~ /^\s+\/protein_id="(.+)"$/){
$protein = $1;
last;
}
}
print " $protein: $size\n";
if ($size > $current_max_size){
$current_max_size = $size;
$current_max_protein = $protein;
}
next;
}
}
if ($current_gene and $current_max_size){
print "$current_max_protein\t$current_max_size\t$current_gene\n";
}
Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!