Question

How to extract CDS zones and list to an secondary text file. Perl

1

Entering edit mode

4.5 years ago

matache.razvan911 • 0

Hello, i have to extract the CDS zones from one chromosome file to an secondary text file. How i manage to extract all of them ? I've read that these zones are considered as tags, and their join complement as secondary tags... Also, i have to list the ORIGIN zone, but i have to cross over the nucleotide string that fix with the one from the CDS and then to write it in the secondary text file bellow the CDS. Can somebody help me handle with this task? I'm newbie in perl..

CDS             ****complement(9413275..9414234)****
                     /gene="ZNF266"
                     /gene_synonym="HZF1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon."
                     /codon_start=1
                     /product="zinc finger protein 266 isoform X3"
                     /protein_id="XP_016881659.1"
                     /db_xref="GeneID:10781"
                     /db_xref="HGNC:HGNC:13059"
                     /db_xref="MIM:604751"
                     /translation="MGTHTGDNPYECKECGKAFTRSCQLTQHRKTHTGEKPYKCKDCG
                     RAFTVSSCLSQHMKIHVGEKPYECKECGIAFTRSSQLTEHLKTHTAKDPFECKICGKS
                     FRNSSCLSDHFRIHTGIKPYKCKDCGKAFTQNSDLTKHARTHSGERPYECKECGKAFA
                     RSSRLSEHTRTHTGEKPFECVKCGKAFAISSNLSGHLRIHTGEKPFECLECGKAFTHS
                     SSLNNHMRTHSAKKPFTCMECGKAFKFPTCVNLHMRIHTGEKPYKCKQCGKSFSYSNS
                     FQLHERTHTGEKPYECKECGKAFSSSSSFRNHERRHADERLSA"

perl genome chromosome CDS ORIGIN • 2.2k views

ADD COMMENT • link updated 4.4 years ago by emi_14_ar • 0 • written 4.5 years ago by matache.razvan911 • 0

0

Entering edit mode

If I understand correctly, you are looking for CDS ranges. If that is all you need, you don't have to parse the GenBank flatfiles. You can get that information from GFF3 or GTF files. It appears that you are interested in human annotation. You can download the GFF3 file for the latest annotation from this FTP path: ftp://ftp.ncbi.nlm.nih.gov//genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz

Then, for example, you can extract the CDS range for XP_016881659.1 using a simple grep; you are likely interested in columns 1, 4, 5, and 7.

zgrep 'XP_016881659' GCF_000001405.39_GRCh38.p13_genomic.gff.gz
NC_000019.10    Gnomon  CDS     9413275 9414234 .       -       0       ID=...<truncated>

ADD REPLY • link 4.5 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Improper, this is my task: You must to design a program capable of extracting only the CDS sections of such a file, which are described in the FEATURES section, to which should be added their corresponding nucleotide sequences described in the ORIGIN section, thus creating a new .txt file with a much simpler structure. The designed program must extract from the original file all portions of the CDS with their description to which it must add, by selective extraction from the ORIGIN section, the corresponding nucleotide sequence, thus creating a new .txt file with, in order, only the descriptions of CDS in which the corresponding nucleotide sequences appear.

I have wrote this code, but i don't know if it is really good, i think it needs more improvement.. but i'm still stucked here

#!/usr/bin/perl -w
use warnings;
use warnings FATAL => q{void};
use warnings FATAL => 'syntax';
use strings;
use Data::Dumper; 

print "Enter your chromosome";
my $chromosome = <STDIN>;
chomp $chromosome;

print "Your file is '$chromosome'\n";
# reading whole file
$file_name = @ARGV[0];
open (FILE, "<file_name>") or die "Sorry, we can't open your  $file_name $!";
@content= <FILE>;
close FILE;
print "\n\n";
$index_max = $#content;
for ($start=0; $start <= $index_max; $start++)
{
    chomp $content[$i];
    print "Reg $i: $content[$i]\n";

    if ($content[$start] =~ /CDS/ )
    {
        print "There is CDS \n";
        for ($start;$start <= $index_max; $start++)
        {
            @cds = (@cds, $content[$start]);
            print @cds;
        }
        open (WRITE, ">>concatenare.txt");
        print WRITE @cds;
        close WRITE;
    }
    #This chromosome have tags. Identify CDS ones.
    my @features =$mySeq =>all_seqfeatures();
    foreach $feature (@features)
    {
        my @tag = $CDS;
        my @feat = $join;
        foreach $tag ( $feat->all tags() );
        print "Feature region has tag", $tag, "CDS",
        join(‘ ‘,$feat->each tag value($tag)), "\n";
    }

    if ($content[$start] =~ /ORIGIN/ )
    {
        print "There is ORIGIN !! \n";
        for ($start;$start <= $index_max; $start++)
        {
        @origin = (@origin, $content[$start]);
        shift (@origin); #Delete ORIGIN row;
        print @origin;
        for @origin
        {
while ($_ =~ m/[ACGTURYKMSWBDHVN]/ig)
    {

        $seq = $seq.$&;
    }
}
$lenght = length($seq);
print "The sequence has $nucleotide length\n";
print "$seq";
        }
    }
}
print "There is ",$index_max," inregistrari\n";
print "This are:\n\n";

my $file = "concatenare.txt";
my $succ = open( my $fh , '>>', $file );

$fh = *STDOUT unless $succ;

print $fh "CDS1 \n";
print $fh "Nucleotide sequence \n";
print $fh "CDS2 \n";
print $fh "Nucleotide sequence \n";

close $fh if $succ; # don't close STDOUT

ADD REPLY • link 4.5 years ago by matache.razvan911 • 0

0

Entering edit mode

Could you please provide the expected output for the protein (XP_016881659.1) in the original post? You are then starting with a GenBank flat file as input then?

ADD REPLY • link 4.5 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

I'm working with an whole chromosome file for input, Chromosome 19. It contains many CDS tags in the FEATURE region. I've listed that CDS as an example.

An exemple for excepted output would be something like this:

  CDS 110679..111596
     /gene="ENSG00000176695.8"
     /protein_id="ENSP00000467301.1"
     /note="transcript_id=ENST00000585993.3"
     /db_xref="CCDS:CCDS32854"
     /db_xref="Uniprot/SWISSPROT:Q8NGA8"
     /db_xref="RefSeq_peptide:NP_001005240"
     /db_xref="RefSeq_mRNA:NM_001005240"
     /db_xref="Uniprot/SPTREMBL:A0A126GWN0"
     /db_xref="UCSC:ENST00000585993.3"
     /db_xref="EMBL:AB065917"
     /db_xref="EMBL:BC136848"
     /db_xref="EMBL:BC136867"
     /db_xref="EMBL:KP290649"
     /db_xref="GO:0004888"
     /db_xref="GO:0004930"
     /db_xref="GO:0004930"
     /db_xref="GO:0004930"
     /db_xref="GO:0004984"
     /db_xref="GO:0004984"
     /db_xref="GO:0005886"
     /db_xref="GO:0005886"
     /db_xref="GO:0007165"
     /db_xref="GO:0007186"
     /db_xref="GO:0007186"
     /db_xref="GO:0007186"
     /db_xref="GO:0007186"
     /db_xref="GO:0007608"
     /db_xref="GO:0016020"
     /db_xref="GO:0016021"
     /db_xref="GO:0016021"
     /db_xref="GO:0016021"
     /db_xref="GO:0050896"
     /db_xref="GO:0050911"
     /db_xref="HGNC_trans_name:OR4F17-202"
     /db_xref="protein_id:AAI36849"
     /db_xref="protein_id:AAI36868"
     /db_xref="protein_id:ALI87807"
     /db_xref="protein_id:BAC06132"
     /db_xref="Reactome:R-HSA-162582"
     /db_xref="Reactome:R-HSA-372790"
     /db_xref="Reactome:R-HSA-381753"
     /db_xref="Reactome:R-HSA-388396"
     /db_xref="Reactome:R-HSA-418555"
     /db_xref="UniParc:UPI0000041E2A"
     /translation="MVTEFIFLGLSDSQGLQTFLFMLFFVFYGGIVFGNLLIVITVVS
     DSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGG
     SEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVAWGIGFLHSVSQLAFAVHLP
     FCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIISYTIILMTI
     QHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKSLDKFLAVFYSVITPLLN
     PIIYTLRNKDMKTAIRQLRKWDAHSSVKF"
    110679.. 111596
    ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGGACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAAT CGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTCACTCA
TTGATCTGTCTCTGTCTTCAGTCACAGCCCCCAAGATGATTACTGACTTTTTCAGCCAGCGCAAAGTCATCTCTTTCAAGGGCTGCCTTGT
TCAGATATTTCTCCTTCACTTCTTTGGTGGGAGTGAGATGGTGATCCTCATAGCCATGGGCTTTGACAGATATATAGCAATATGCAAACC  CCTACACTACACTACAATTATGTGTGGCAACGCATGTGTCGGCATTATGGCTGTCGCATGGGGAATTGGCTTTCTCCATTCGGTGAGCC
AGTTGGCCTTTGCCGTGCACTTACCCTTCTGTGGTCCCAATGAGGTCGATAGTTTTTATTGTGACCTTCCTAGGGTAATCAAACTTGCCTG   TACAGATACCTACAGGCTAGATATTATGGTCATTGCTAACAGTGGTGTGCTCACTGTGTGTTCTTTTGTTCTTCTAATCATCTCATACACT  ATCATCCTAATGACCATCCAGCATCGCCCTTTAGATAAGTCGTCCAAAGCTCTGTCCACTTTGACTGCTCACATTACAGTAGTTCTTTTGT TCTTTGGACCATGTGTCTTTATTTATGCCTGGCCATTCCCCATCAAGTCATTAGATAAATTCCTTGCTGTATTTTATTCTGTGATCACCCCT
CTCTTGAACCCAATTATATACACACTGAGGAACAAAGACATGAAGACGGCAATAAGACAGCTGAGAAAATGGGATGCACATTCTAGT TAAAGTTTTAG

ADD REPLY • link 4.5 years ago by matache.razvan911 • 0