Question

Extraction Of Fasta Sequence By Nucleotide Position

2

Entering edit mode

11.6 years ago

rosarylimyt ▴ 70

Hi all,

I'm relatively new in Python and I need help in extraction of FASTA sequences using separate lists of start and end coordinates. Here are my codes below:

import sys
import os
from Bio import SeqIO

infile1 = sys.argv[1] # your original FASTA file to extract sequences from                                                                                                                               
fasta_file = open(infile1)

### To extract your set of start and end coordinates                                                                                                                                                       
start_coordinates = os.system("cat indices_trial | awk '{print $3}'")
end_coordinates = os.system("cat indices_trial | awk '{print $5}'")

results_file = raw_input('Enter a new filename to save your results:')
outfile = open(results_file, 'w')

### Now to tell Python to extract those sequences whose coordinates were not specified in 'start_coordinates' & 'end_coordinates' list                                                                     
for line in SeqIO.parse(fasta_file, "fasta"):
    if line.seq != line.seq["%i-1:%i " % start_coordinates, end_coordinates]:
        print 'Unpredicted gene:' , line.seq , ',' , 'Sequence length =' , len(line.seq)
    else:
        print 'Not found.'

Can some python programmer out there help me?

Rosary

fasta python nucleotide position • 8.5k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 11.6 years ago by rosarylimyt ▴ 70

0

Entering edit mode

if it were any help, here's the execution and the error message that follows:

1 5

4 10

Enter a new filename to save your results:trial_results Traceback (most recent call last): File "extract_UNpredicted_genes.py", line 17, in <module> if line.seq != line.seq["%i:%i + 1" % start_coordinates, end_coordinates]: TypeError: not enough arguments for format string

ADD REPLY • link 11.6 years ago by rosarylimyt ▴ 70

0

Entering edit mode

Hi, please show us a few sample lines of infile1.

ADD REPLY • link 11.6 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Hi,

An example of infile1 would be;

Scaffold2 TAAAATGTGAGTTTTTAAATTACTGCGTGCCTCTGGGTAATAATCTCCCCAAAGTAATTC CGATAAATCTTCATAGCTAACCGTTTTATCATGGTTTTTCATGAGCATATAAATAATGCG ACCTTCACTTACGGTGAGATTTACCAGTGTGCCATTTATCCATAGTTCTCTAATTGATGA ACCAAAATGCATTTTACCACAAGTGATGGATAGGTCCTCACCGTTTATTTGCTTTTTAGT TAATAATCGCCTGACTCTGGCTAGTAATTCCATTTGGCGGAACGGTTTAATAATATAATC

followed by scaffold 3,4,5 and so on, they are basically fasta nucleotide sequences.

Thanks!

ADD REPLY • link 11.6 years ago by rosarylimyt ▴ 70

score 3 · Answer 1 · 2013-05-10

The error message is telling you that the %-style string formatting isn't working. I think if you wrap your coordinates in parentheses you won't get that error anymore.

BUT...

If I can grok what you are trying to do you'll get a bunch of new errors instead!

The % synatx is for creating strings with the arguements inserted. But you want to use the integers themselves to slice your sequence line.seq[start_coordinates:end_coordinates] would do the trick (be aware python uses 0-based indices). Also, if you are just learning python it's porbably a good idea to learn the newer .format()-style string formatting.

It's unusual to all awk to read your indices. You can open a file with open(filename), move through the lines with a for loop , and strip a tab-delimited line into it's elements with line.strip('\t'). Once you have a list of indices you'll need to iterate through the list to get each combination (not, as I think you are trying to do now, use the whole list to do your squence-slices).

Generally, the best way to get to grips with Python and Biopython is (a) read a good tutorial (b) open an interactive prompt and play around a bit. Once you are conformatable with slicing sequences and handling file I/O and the like you'll find building your script much easier.

score 2 · Answer 2 · 2013-05-10

A separate word of caution unrelated to the script itself (and apologies if you know this already - I just thought it was important to mention): some file formats have 0-based coordinates, while others have 1-based ones, and it's pretty random which file format uses which (e.g. BED files are 0-based while GGF files are 1-based). So in order to extract the correct coordinates, it's important to know which format the coordinates from your input file are in. I quote the following nice explalation from the beginning of the SAM file specifications (http://samtools.sourceforge.net/SAM1.pdf):

1-based coordinate system: A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7th bases inclusive is [3, 7]. The SAM, GFF and Wiggle formats are using the 1-based coordinate system.

0-based coordinate system: A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BED, and PSL formats are using the 0-based coordinate system.

score 2 · Answer 3 · 2013-05-10

2

Entering edit mode

11.6 years ago

Fabio Marroni ★ 3.0k

I suggest you use pyfasta. Follow the link, it also has some very simple examples. It's the first python command I used and it was really easy. https://pypi.python.org/pypi/pyfasta/

ADD COMMENT • link 11.6 years ago by Fabio Marroni ★ 3.0k