Question

The efficient way to generate fasta and bed files

0

Entering edit mode

6.7 years ago

elisheva ▴ 120

Hi everyone!
I have big fasta files (about 1-2 GB ) received by bedtools getfasta command each file has a header with information (chromosome, coordinates, and strand) and a sequence of 10 nucleotides.
I have to select only the sequences which have pyrimidines in 4-5 columns and generate fasta and bed files for them.
I wrote a python script for it, but the problem is it takes a lot of time. here is my code:

#This script generates new files which contain only pyrimidines.
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    chromosome = Id[Id.find(':') +2:Id.rfind(':')]
    start = Id[Id.rfind(':') +1:Id.find('-')]
    end = Id[Id.find('-') +1:Id.find('()')]
    strand = Id[Id.find('>') +1:Id.find(':')]
    lineBed = (chromosome,start,end,strand)
    line = "\t".join(lineBed) + "\n"
    return line


def writeFiles(path_to_directory,path_to_file,name):
""" This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = (open(path_to_directory + "/filtered_bed/" + name + ".bed", "wb+")) #filtered bed file.
    fasta_file = (open(path_to_directory + "/filtered_fasta/" + name, "wb+")) #filtered fasta file.
    with open(path_to_file, mode='r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()
            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_file.writebedFormatrecord.id))
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()


def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

For example, if this is the input file:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

the desired output fasta file is:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg

without the sequences:

>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

which have GA or TA in 4,5 columns.

and the matching desired output bed file - contains only the sequences in the fasta file is:

chrY    59351341        59351351        +
chrY    59352450        59352460        +
chrY    59352832        59352842        +

Can anyone suggest a better solution for this goal?

python fasta bed • 2.6k views

ADD COMMENT • link 6.7 years ago by elisheva ▴ 120

1

Entering edit mode

please, comment or validate your previous questions:

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

show us an example of input and of desired output.

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I don't understand why you want to convert a fasta to a bed file ? Fasta files contain only the primary DNA sequence not genomic coordinates ? Could you explain a little bit more ? Are the genomic informations in the fasta header ?

ADD REPLY • link 6.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

what's the difference between input.fasta and output.fasta ? you're not explaining us what you're trying to do, we're not medium.

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

6.7 years ago

Bastien Hervé 5.9k

As said in comment, it's hard to unsterstand what you want to achieve, but given your examples :

I would say that your writeFiles is well wrote, except this line bed_file.writebedFormatrecord.id)) and the path finding

I assume that you want to pass your record.id to your bedFormat function, then to write the result in a file :

from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
import re
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    splitted_line = re.split(':|-',Id[:-2])
    splitted_line.remove('')
    return "\t".join([splitted_line[i] for i in [1,2,3,0]]) + "\n"

def writeFiles(path_to_directory,path_to_file,name):
    """ This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = open(path_to_directory + "/filtered_bed/" + name + ".bed", "a") #filtered bed file.
    fasta_file = open(path_to_directory + "/filtered_fasta/" + name + ".fasta", "wb+") #filtered fasta file.

    with open(path_to_file+"/"+name+ ".fasta", mode='r') as handle:

        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()

            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_line = bedFormat( record.id) ###There is no space between "(" and "r" but biostars eat my "(" without adding a space
                bed_file.write(bed_line)
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()

def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

I guess that all these rfind and find take to much time. Do you win time with my bedFormat function ?

How many reads do you have ?

ADD COMMENT • link 6.7 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Thank a lot for the solution, but for some reason, it took even much time

ADD REPLY • link 6.7 years ago by elisheva ▴ 120

0

Entering edit mode

what is the size of your input file ?

ADD REPLY • link 6.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Try to supply more informations about your input files and try to investigate the running time of each function in the script above.

ADD REPLY • link 6.7 years ago by Bastien Hervé 5.9k

score 3 · Accepted Answer · 2018-04-12

assuming two lines per fata record:

cat input.fa  |\
paste - -  |\
awk 'toupper(substr($2,4,2)) ~ /^(TT|TC|CT|CC)$/' |\
tr "\t" "\n" |\
tee output.fa |\
awk -F '[>(:-]' '/^>/ {printf("%s\t%s\t%s\t%s\n",$4,$5,$6,$2);}'


chrY    59351341    59351351    +
chrY    59352450    59352460    +
chrY    59352832    59352842    +

$ cat output.fa

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg