Hi,
I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:
#!/usr/bin/python2.7
from Bio import SeqIO
import getopt,sys,re
def usage():
print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"
try:
options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')
except getopt.GetoptError as err:
print str(err)
usage()
sys.exit()
for opt, arg in options:
if opt in ('-i'):
input_file=arg
if opt in ('-h'):
usage()
sys.exit()
elif opt in ('-o'):
output_file=arg
out=open(output_file, 'w')
sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])
m=re.sub('[nN]+','\n',sequence).split('\n')
for i in range(0,len(m)):
out.write('>contig_'+str(i)+'\n')
out.write(m[i]+'\n')
Remember, you can accept multiple answers, as long as they provide valid solution to your top level problem. Accepting an answer is your single most important contribution to the community, as it helps others that face the same problem you did.
Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut )
samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome
What do you want to happen when you reach 100 Ns in a row?
Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base