Entering edit mode
4.5 years ago
Hello, i have to extract the CDS zones from one chromosome file to an secondary text file. How i manage to extract all of them ? I've read that these zones are considered as tags, and their join complement as secondary tags... Also, i have to list the ORIGIN zone, but i have to cross over the nucleotide string that fix with the one from the CDS and then to write it in the secondary text file bellow the CDS. Can somebody help me handle with this task? I'm newbie in perl..
CDS ****complement(9413275..9414234)****
/gene="ZNF266"
/gene_synonym="HZF1"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon."
/codon_start=1
/product="zinc finger protein 266 isoform X3"
/protein_id="XP_016881659.1"
/db_xref="GeneID:10781"
/db_xref="HGNC:HGNC:13059"
/db_xref="MIM:604751"
/translation="MGTHTGDNPYECKECGKAFTRSCQLTQHRKTHTGEKPYKCKDCG
RAFTVSSCLSQHMKIHVGEKPYECKECGIAFTRSSQLTEHLKTHTAKDPFECKICGKS
FRNSSCLSDHFRIHTGIKPYKCKDCGKAFTQNSDLTKHARTHSGERPYECKECGKAFA
RSSRLSEHTRTHTGEKPFECVKCGKAFAISSNLSGHLRIHTGEKPFECLECGKAFTHS
SSLNNHMRTHSAKKPFTCMECGKAFKFPTCVNLHMRIHTGEKPYKCKQCGKSFSYSNS
FQLHERTHTGEKPYECKECGKAFSSSSSFRNHERRHADERLSA"
If I understand correctly, you are looking for CDS ranges. If that is all you need, you don't have to parse the GenBank flatfiles. You can get that information from GFF3 or GTF files. It appears that you are interested in human annotation. You can download the GFF3 file for the latest annotation from this FTP path: ftp://ftp.ncbi.nlm.nih.gov//genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
Then, for example, you can extract the CDS range for XP_016881659.1 using a simple grep; you are likely interested in columns 1, 4, 5, and 7.
Improper, this is my task: You must to design a program capable of extracting only the CDS sections of such a file, which are described in the FEATURES section, to which should be added their corresponding nucleotide sequences described in the ORIGIN section, thus creating a new .txt file with a much simpler structure. The designed program must extract from the original file all portions of the CDS with their description to which it must add, by selective extraction from the ORIGIN section, the corresponding nucleotide sequence, thus creating a new .txt file with, in order, only the descriptions of CDS in which the corresponding nucleotide sequences appear.
I have wrote this code, but i don't know if it is really good, i think it needs more improvement.. but i'm still stucked here
Could you please provide the expected output for the protein (XP_016881659.1) in the original post? You are then starting with a GenBank flat file as input then?
I'm working with an whole chromosome file for input, Chromosome 19. It contains many CDS tags in the FEATURE region. I've listed that CDS as an example.
An exemple for excepted output would be something like this: