Question

Script for concatenating (multiple fasta sequences in a single file), creating reverse complement and counting the length of final sequence

0

Entering edit mode

8.1 years ago

adeena_hassan ▴ 50

Assalam-o-alaikum everyone,

I have downloaded whole genome of some vertebrates from NCBI and then then fetched coordinates and sequence of genes and CDS from whole genome subsequently.

Now I have CDS file like below:

$ cat input file:

>chr16:7161867-7161955
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTT
>chr16:7161148-7161197
TGGGTGACAGAAAACTCACATAAGAGATAAACTTGATTGGCCACAGTAT
>chr16:7160414-7160581
TGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGATACCATTGTGTTTCACTATAACATTGGTAGGTTCC
>chr16:7157321-7157473
TCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTCATC

I want to concatenate these parts of CDS to a single sequence then making its reverse complement and counting length of the final sequence. P.S all parts of CDS in a single file

Expected output:

>Dog
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTTTCACATAAGAGATAAACTTGATTGGCCACAGTATTGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGAATAACATTGGTAGGTTCCTCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTC

I have tried following script but its not give my desired output (its not give concatenating sequence against a single fasta headers).

open my $fh,"<","filename.text" or die"error opening $!";

$/ = ">";

<$fh>;

while (<$fh>)
{
    my ($header,@ar) = split("\n",$_);

    my $entry =join("\n",@ar);

    $entry = reverse $entry;

    $entry =~ tr/ACGUacgu/UGCAugca/;

    print ">$header\n$entry\n\n";
}

Any other way to do so ??

reverse-complement concatenate • 4.9k views

ADD COMMENT • link updated 8.1 years ago by Brian Bushnell 20k • written 8.1 years ago by adeena_hassan ▴ 50

2

Entering edit mode

little late but I want to show you this on Perl:

#!/usr/bin/perl

use strict;
use warnings;

open my $fh,"<","filename.text" or die"error opening $!";

$/ = "\n>"; # better split by the new line and ">"

my $seq = '';

while (<$fh>)
{
    s/>//g;
    my ($header, @seq) = split("\n", $_);
    my $entry = reverse(join("", @seq));
    $entry =~ tr/ACGUacgu/UGCAugca/;
    $seq .= $entry; # concat the new sequence
}

print ">DOG\n$seq\n";

ADD REPLY • link 8.1 years ago by JC 13k

0

Entering edit mode

8.1 years ago

Brian Bushnell 20k

With the BBMap package:

fuse.sh in=input.fa out=x.fa
reformat.sh in=x.fa out=result.fa rcomp

That will also print the number of bases to the screen.

ADD COMMENT • link 8.1 years ago by Brian Bushnell 20k

score 6 · Accepted Answer · 2017-09-20

6

Entering edit mode

8.1 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python

import sys
from Bio.Seq import Seq

with open(sys.argv[1], 'r') as f:
    seq = ''
    for line in f:
        if not line.startswith(">"):
            seq += line.strip()

rc = str(Seq(seq).reverse_complement())

print '>Dog ' + str(len(rc)) + '\n'  + rc

Output (single-line seq):

>Dog 291
GATGAAGAGAAAATAGACTGTTTTCCTGATGAGAGCGGTGCTTCNGCAGCAAACTGTACTGCCCGTGGCTGTGCCTGGGAGGAACCTACCAATGTTATAGTGAAACACAATGGTATCCCAAGTCAAACTTCTCCTAATGTCACTTATGATTCTAACCTACAATACTGTGGCCAATCAAGTTTATCTCTTATGTGAGTTTTCTGTCACCCAAACCCCCTTGGTCTTATCATTGCCCTGGATGAAAACAAAGAAGCAAAAGGCGAACTTTTCTGGGATGATGGGGAAACTAAA

ADD COMMENT • link 8.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Hi st.ph.n, I have used your code but it gives following error.

Error:

Traceback (most recent call last):
  File "./reverse_complement.py", line 6, in <module>
    with open(sys.argv[1], 'r') as f:Dog_MGAM_CDS.fa
IndexError: list index out of range

IndentationError: expected an indented block

I don't know much about python could you tell me what to do with it.

ADD REPLY • link 8.1 years ago by adeena_hassan ▴ 50

1

Entering edit mode

You need to provide proper command line arguments.

Python code.py fasta_file

ADD REPLY • link 8.1 years ago by Renesh ★ 2.2k

0

Entering edit mode

Thank u so much its working :)

ADD REPLY • link 8.1 years ago by adeena_hassan ▴ 50

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

1

Entering edit mode

I forgot to mention, sys.argv[1] is the name of the input file. So, if you saved the script as reverse_complement.py, run as python reverse_complement.py Dog_MGAM_CDS.fa

ADD REPLY • link 8.1 years ago by st.ph.n ★ 2.7k

score 2 · Accepted Answer · 2017-09-20

2

Entering edit mode

8.1 years ago

Renesh ★ 2.2k

You can use python,

from Bio import SeqIO
from Bio.Seq import Seq
import argparse

parser = argparse.ArgumentParser(description="Concatenate multiple sequences")
parser.add_argument('-f', action='store', dest='fasta_file', help='Input fasta file')
result = parser.parse_args()
con_seq = ""

for rec in SeqIO.parse(result.fasta_file, "fasta"):
    con_seq += rec.seq

rc_con_seq = Seq(str(con_seq)).reverse_complement()
print rc_con_seq, len(rc_con_seq)

Save above code in code.py file and run as,

python code.py -f fasta_file

Note: You need to have biopython for running this code

ADD COMMENT • link 8.1 years ago by Renesh ★ 2.2k

1

Entering edit mode

Note: there is no need for manual opening of the fasta file, you can just pass the filename to SeqIO.parse()