How to know that required sequence has been extracted
2
0
Entering edit mode
7.1 years ago

Hi

I am new to python. I have a multi-fasta file containing , 21 merged fasta file of Mycosphaerella graminicola chr1-21 ,whole genome shotgun sequence.I have a python code that takes Id as input(>ENA|CM001216|CM001216.1) and extract respective sequence.

My problem is, I am not sure how to make sure that right sequences is being extracted. On linux , console, the ouput prints something like below, the huge chunk of sequence without any track of starting with Ids

ACATATTGCAGAGCATGGAGTTCTCCCGGTCTACAGTTAAGACCAGATCCGCTACATCGC
ATAGGAAAGACTCCTCGCTAGCAGCACTAGTGCACCGGACGATGACCGACCGGTCCTATC
GCTCGAGCGGTCCTATCTAACACACCAGCAGCGGCCCTTTACAATAACGCGCCGGTAGGA
CATTGATGCCCTATTCCTACCACTAACCTCACTTGCAGCCTATCGGCATAGCATCTAGTT
CTGTTACCAGCTACGGTTTACAAAGACTCTTTCTAGCAACTGGCACACACCTATCGGCCG
GTTAGGCATCCGCCCAACGATATGCAAGCACTTCCGGCTCGCCCGTAGCTTTAGTGCTAC
TAGCTTCGATACCTACATCTTCTTCCCAAACCTACCTTTGCTATCTACAAAGCGGTAACG
AGCGGTCGACAAACAGCGCTACTGCTACCTTAACCGGATCTAGCAAGCGCATTTTATTAA
CGAAATCTTTATACCTACCCTCGAAGACACCTACTAAGACGACATTCTCCAGCACCATCC
TCGCACTTTCGAACAGGCGTATAACCGGTCCTACGCTCGCTAGCGGCAGTCGATTCGGTC
GACCAACCGGTCGTTTAGCGAGCTATTCTACGACCTTCCAAATACATGGCCGAGCGATCA
GGCCCGGCGTCCTAGGGTACCACTGCTCGCGCGTCTCTAGACCCGTATTTGTGAACGCCT
AGCAGACGACGAAGAGGACTAGCAGGAAGCGGAACTGATTACAGTACGCTACGGCTACAA
GCTCGAGCTCCGTAAGACGTCTGTCGCCAGCCTTCGGCAAGTGGTCGACCGGTCCCTTAA
ACGACAATTCGACAAGGAGTACATCGACTTCGACACTGCCTTCGTAGATCTTGCGTTTAA
GGACCTAGCCCACCAACCTACAAACCGGTCGAGGGTTTATCGAGAACCCCTCGTTCTACT
CCGCTGTACGAACTGCAACGAATACCAAGCCAGCTAGCTTAATATAAAGCCAGAGCTCTA
CTAGTGGAGTAGG
sehlly@ShellysPC:~/Documents/Mycosphaerella_sequence1$

My code is:

#!/usr/bin/python
import re
import sys

Found = False
InputID = raw_input("Input Id :")

with open ('Mgraminicola.fasta','r') as f:
              seqs = f.readlines()

for seq in seqs:
          if re.search (InputID,seq):
                  Found = True
                  sys.stdout.write(seq)
                    continue
if Found == True:
                    if re.search ('>ENA|CM00',seq):      
                              Found = False
                    if Found:
                               sys.stdout.write(seq)

Any help would be highly appreciated.

Thanks

sequence • 1.5k views
ADD COMMENT
1
Entering edit mode

Have a look at: http://biopython.org/wiki/SeqIO

Also, you don't need to separate each code line with an empty line!

ADD REPLY
1
Entering edit mode
7.1 years ago

Bingo, I have my own code in Python


Usage

fetch_sequence_by_id.py -i your_fasta_file.fa -f file_containing_sequence_ids.txt -o output_fasta_file.fa

Program

from Bio import SeqIO
import getopt, sys, re
id_list=[]
sequence_list={}

def usage():
    print "Usage: fetch_sequence_by_id.py -i <input_fasta> -f <file_containing_sequence_ids> -o <output_fasta_file>"


try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:f:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-f'):
        id_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out_fasta=open(output_file, 'w')
out_id_file=open(id_file)

for ids in out_id_file:
    ids=ids.rstrip()
    id_list.append(ids)

for record in SeqIO.parse(input_file, "fasta"):
    sequence_list[record.id]=record.seq


for ids in id_list:
    for record in SeqIO.parse(input_file, "fasta"):
        if ids in strrecord.id):
            out_fasta.write(">" + ids +  "\n")
            out_fasta.write(str(record.seq) + "\n")

out_fasta.close()
ADD COMMENT
1
Entering edit mode

Ultra fast solution using seqkit

cat Your_fasta.fa | seqkit grep -f your_ids.txt > required_fasta.fa
ADD REPLY
2
Entering edit mode

Ultra-fast solution has to be faSomeRecords from Jim Kent.

ADD REPLY
1
Entering edit mode

Thanks for updating me. I dint know that

ADD REPLY
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6