The efficient way to generate fasta and bed files
2
0
Entering edit mode
6.7 years ago
elisheva ▴ 120

Hi everyone!
I have big fasta files (about 1-2 GB ) received by bedtools getfasta command each file has a header with information (chromosome, coordinates, and strand) and a sequence of 10 nucleotides.
I have to select only the sequences which have pyrimidines in 4-5 columns and generate fasta and bed files for them.
I wrote a python script for it, but the problem is it takes a lot of time. here is my code:

#This script generates new files which contain only pyrimidines.
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    chromosome = Id[Id.find(':') +2:Id.rfind(':')]
    start = Id[Id.rfind(':') +1:Id.find('-')]
    end = Id[Id.find('-') +1:Id.find('()')]
    strand = Id[Id.find('>') +1:Id.find(':')]
    lineBed = (chromosome,start,end,strand)
    line = "\t".join(lineBed) + "\n"
    return line


def writeFiles(path_to_directory,path_to_file,name):
""" This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = (open(path_to_directory + "/filtered_bed/" + name + ".bed", "wb+")) #filtered bed file.
    fasta_file = (open(path_to_directory + "/filtered_fasta/" + name, "wb+")) #filtered fasta file.
    with open(path_to_file, mode='r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()
            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_file.writebedFormatrecord.id))
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()


def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

For example, if this is the input file:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

the desired output fasta file is:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg

without the sequences:

>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

which have GA or TA in 4,5 columns.

and the matching desired output bed file - contains only the sequences in the fasta file is:

chrY    59351341        59351351        +
chrY    59352450        59352460        +
chrY    59352832        59352842        +

Can anyone suggest a better solution for this goal?

python fasta bed • 2.6k views
ADD COMMENT
1
Entering edit mode

please, comment or validate your previous questions:

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

show us an example of input and of desired output.

ADD REPLY
0
Entering edit mode

I don't understand why you want to convert a fasta to a bed file ? Fasta files contain only the primary DNA sequence not genomic coordinates ? Could you explain a little bit more ? Are the genomic informations in the fasta header ?

ADD REPLY
0
Entering edit mode

what's the difference between input.fasta and output.fasta ? you're not explaining us what you're trying to do, we're not medium.

ADD REPLY
3
Entering edit mode
6.7 years ago

assuming two lines per fata record:

cat input.fa  |\
paste - -  |\
awk 'toupper(substr($2,4,2)) ~ /^(TT|TC|CT|CC)$/' |\
tr "\t" "\n" |\
tee output.fa |\
awk -F '[>(:-]' '/^>/ {printf("%s\t%s\t%s\t%s\n",$4,$5,$6,$2);}'


chrY    59351341    59351351    +
chrY    59352450    59352460    +
chrY    59352832    59352842    +

$ cat output.fa

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
ADD COMMENT
0
Entering edit mode
6.7 years ago

As said in comment, it's hard to unsterstand what you want to achieve, but given your examples :

I would say that your writeFiles is well wrote, except this line bed_file.writebedFormatrecord.id)) and the path finding

I assume that you want to pass your record.id to your bedFormat function, then to write the result in a file :

from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
import re
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    splitted_line = re.split(':|-',Id[:-2])
    splitted_line.remove('')
    return "\t".join([splitted_line[i] for i in [1,2,3,0]]) + "\n"

def writeFiles(path_to_directory,path_to_file,name):
    """ This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = open(path_to_directory + "/filtered_bed/" + name + ".bed", "a") #filtered bed file.
    fasta_file = open(path_to_directory + "/filtered_fasta/" + name + ".fasta", "wb+") #filtered fasta file.

    with open(path_to_file+"/"+name+ ".fasta", mode='r') as handle:

        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()

            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_line = bedFormat( record.id) ###There is no space between "(" and "r" but biostars eat my "(" without adding a space
                bed_file.write(bed_line)
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()

def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

I guess that all these rfind and find take to much time. Do you win time with my bedFormat function ?

How many reads do you have ?

ADD COMMENT
0
Entering edit mode

Thank a lot for the solution, but for some reason, it took even much time

ADD REPLY
0
Entering edit mode

what is the size of your input file ?

ADD REPLY
0
Entering edit mode

Try to supply more informations about your input files and try to investigate the running time of each function in the script above.

ADD REPLY

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6