how to use list of gene id to get cds sequence(cds fasta file have many annotation, only gene id: is same to query id)
1
0
Entering edit mode
4.6 years ago
LZH289 • 0

hello, i have a question when i want to extract cds sequence using gene id. but cds file is not just start with >gene is, it has many other annotation. the only same is star with gene:

cds fasta:

>Zm002 cds gene:Zm1d035916 gene_biotype:protein_coding 
ATCGGCAT
>Zm001 cds RefGen_v4:9:153880862:153883850:-1 gene:Zm1d048 gene_biotype:protein_coding
ATGCGGCA

gene_list

Zm1d035916
Zm1d048

how to get result like

>Zm1d035916
ATCGGCAT
>Zm1d048 
ATGCGGCA
sequence gene • 1.7k views
ADD COMMENT
0
Entering edit mode

I think biopython might help : http://biopython.org/DIST/docs/tutorial/Tutorial.html Refer to section 2.4.1 :

from Bio import SeqIO

for seq_record in SeqIO.parse("file_name.fasta", "fasta"):
    printseq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

You should get something like this on your screen:

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740

I think it is very straightforward, once you get the seq_record.id then you can slice the specific substring from the seq_record.id by using str.find (for example) in python

ADD REPLY
2
Entering edit mode
4.6 years ago
JC 13k

Nothing that a Perl script can do:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use getSeqs.pl <File with IDs> <Input Fasta> <Output Fasta>\n";
my $list_file = shift @ARGV;
my $fasta_in = shift @ARGV;
my $fasta_out = shift @ARGV;

my %sel;
open (my $lh, "<", $list_file) or die;
while (<$lh>) {
    chomp;
    $sel{$_}++;
}
close $lh;

$/ = "\n>";
open (my $ih, "<", $fasta_in) or die;
open (my $oh, ">", $fasta_out) or die;
while (<$ih>) {
    s/>//g;
    my ($id_line, @seq) = split (/\n/, $_);
    if ($id_line =~ /gene:(\w+)/) {
        my $id = $1;
        if (defined $sel{$id}) {
            print $oh ">$id\n";
            print $oh join "\n", @seq;
            print $oh "\n";
        }
    }
}
close $ih;
close $oh;
ADD COMMENT
0
Entering edit mode

thank you for your reply. when I use perl in terminal, and run perl cds_sequence.pl gene_list.txt cds.fa cds_extract.fa it shows error like this: Can't use global $/ in "my" at cds_sequence.pl line 19, near "my $/ " Execution of cds_sequence.pl aborted due to compilation errors. how to solve this problem?

ADD REPLY
0
Entering edit mode

removing the "my" before the "$/", I edited the code

ADD REPLY
0
Entering edit mode

it's work. Thank you!

ADD REPLY
0
Entering edit mode

can u please tell me how to run this code, means where I should mention my files in this code, I am new in programming.

ADD REPLY

Login before adding your answer.

Traffic: 2123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6