Hi everyone!
I have big fasta files (about 1-2 GB ) received by bedtools getfasta command
each file has a header with information (chromosome, coordinates, and strand) and a sequence of 10 nucleotides.
I have to select only the sequences which have pyrimidines in 4-5 columns and generate fasta and bed files for them.
I wrote a python script for it, but the problem is it takes a lot of time.
here is my code:
#This script generates new files which contain only pyrimidines.
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
PYRIMIDINES = ["TT","TC","CT","CC"]
def bedFormat(Id):
"""This function manipulates the fasta file header for getting appropriate bed file """
chromosome = Id[Id.find(':') +2:Id.rfind(':')]
start = Id[Id.rfind(':') +1:Id.find('-')]
end = Id[Id.find('-') +1:Id.find('()')]
strand = Id[Id.find('>') +1:Id.find(':')]
lineBed = (chromosome,start,end,strand)
line = "\t".join(lineBed) + "\n"
return line
def writeFiles(path_to_directory,path_to_file,name):
""" This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
bed_file = (open(path_to_directory + "/filtered_bed/" + name + ".bed", "wb+")) #filtered bed file.
fasta_file = (open(path_to_directory + "/filtered_fasta/" + name, "wb+")) #filtered fasta file.
with open(path_to_file, mode='r') as handle:
for record in SeqIO.parse(handle, 'fasta'):
seq = record.seq.upper()
if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns.
bed_file.writebedFormatrecord.id))
SeqIO.write(record, fasta_file, "fasta")
bed_file.close()
fasta_file.close()
def main():
path_to_directory = sys.argv[1] #The directory path
path_to_file = sys.argv[2] #The file's path
name = sys.argv[3] #The file's name
writeFiles(path_to_directory,path_to_file,name)
if __name__ == '__main__':
main()
For example, if this is the input file:
>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt
the desired output fasta file is:
>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
without the sequences:
>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt
which have GA or TA in 4,5 columns.
and the matching desired output bed file - contains only the sequences in the fasta file is:
chrY 59351341 59351351 +
chrY 59352450 59352460 +
chrY 59352832 59352842 +
Can anyone suggest a better solution for this goal?
please, comment or validate your previous questions:
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
show us an example of input and of desired output.
I don't understand why you want to convert a fasta to a bed file ? Fasta files contain only the primary DNA sequence not genomic coordinates ? Could you explain a little bit more ? Are the genomic informations in the fasta header ?
what's the difference between input.fasta and output.fasta ? you're not explaining us what you're trying to do, we're not medium.