How can I make a .bed file from my fasta?
1
0
Entering edit mode
6.1 years ago
jaqx008 ▴ 110

Hello All, I have a fasta file that I made. It contains nucleotide sequences for specific proteins, like Ago, Dicer, Piwi etc. in the usual fasta format:

> Ago1
CATGCATGTAGGTTAACATGCATGTAGGTTAAGGTGGT
CATGCATGCATGCATGTAGGTTAAGGTTAGGTTAAGGT
GCATGTAGGTTAAGCATGCATGTAGGAGGTGT
>Piwi2
CATGCATGTAGGTTAACATGCATGTAGGTTAAGGTGGT
CATGCATGCATGCATGTAGGTTAAGGTTAGGTTAAGGT
GCATGTAGGTTAAGCATGCATGTAGGAGGTGT

I want to get the bed file from this. I have tried the following; made index, map with bowtie2 to genome of the organism, obtained .sam , convert sam to bam, do bamtobed. but at the end of the day I get nothing in the out put of bed (zero byte file) Am i doing something wrong and what should I do? Thanks

fasta bed bowtie2 mapping • 6.4k views
ADD COMMENT
0
Entering edit mode

Do you have some sort of coordinates of these proteins in a genome? Why do you want to save this data in the BED format?

ADD REPLY
0
Entering edit mode

are your sequences located on the genbank? if so, you can retrieve genebank files and execute a script such as :

#!/usr/bin/env python
# encoding: utf-8

"""
bed_from_genbank.py

grab the gene records from a genbank file (edit for other record types).

- requires:  biopython

"""

from Bio import SeqIO

import pdb

def main():
    outf = open('/../../data.bed', 'w')
    header = """track name=---="genes" itemRgb=On\n"""
    outf.write(header)
    for record in SeqIO.parse(open("/../../sequence.gb", "rU"), "genbank") :
        for feature in record.features:
            if feature.type == 'gene':
                start = feature.location.start.position
                stop = feature.location.end.position
                try:
                    name = feature.qualifiers['gene'][0]
                except:
                    # some features only have a locus tag
                    name = feature.qualifiers['locus_tag'][0]
                if feature.strand < 0:
                    strand = "-"
                else:
                    strand = "+"
                bed_line = "sequence\t{0}\t{1}\t{2}\t1000\t{3}\t{0}\t{1}\t65,105,225\n".format(start, stop, name, strand)
                outf.write(bed_line)
    outf.close()


if __name__ == '__main__':
    main()
ADD REPLY
0
Entering edit mode

I can try but I think for this organism some of the genes have not been annotated.

ADD REPLY
0
Entering edit mode

I have a fasta file that I made.

Was it made de novo or it came from the output of some analysis/manipulation you did? It would be useful to know how you got this file.

ADD REPLY
0
Entering edit mode

basically obtained nucleotide sequence of the proteins from NCBI then add the >header and pasted all sequences in the same text file. renamed it to .fasta

ADD REPLY
0
Entering edit mode

Since the BED formalt minimally requires a chr identifier and a start and stop to create a valid BED record you are not going to be able to use this fasta file. You would need to get that information from the original records. It is unclear what you are planning to do using the file. If you want to comment on that then we can look at options.

How did you get all those D's in nucleotide sequence?

ADD REPLY
0
Entering edit mode

That is not my sequence. its just randomly typed to give a glimpse of what my fasta looks like. I will manually make the bed by specifying gene name and start stop position and use this for my down strea,m application

ADD REPLY
0
Entering edit mode

Did you mix up D and G? This might be an issue for bowtie. Are your genes spliced? If so, you'll need a splice-aware aligner (STAR, HISAT2, BBmap )

ADD REPLY
0
Entering edit mode

It sounds like your trouble is in the bamtobed step. I would focus on troubleshooting that. Might want to check the .sam output to make sure it actually has content and that the alignment worked.

ADD REPLY
1
Entering edit mode
6.1 years ago

A .bed file describe a position or interval on the genome. For example:

chromosome1    40    100    feature X

feature X is located on chromosome 1 from position 40 to 100.

A .fasta file is the nucleotide or protein sequence of the feature.

To generate a .bed file from sequence, you would need to map the sequence back onto the reference genome and extract the coordinates.

ADD COMMENT
0
Entering edit mode

How can this be done? I have a fasta file and I need to get the bed cordinates that will have start and stop positions as mapped to the genome.

ADD REPLY
1
Entering edit mode
  1. Align fasta to genome to create bam
  2. Use bamtobed
ADD REPLY
0
Entering edit mode

This worked fine for what I wanted. Thanks

ADD REPLY
1
Entering edit mode

This smacks of an XY problem. Can you tell us what the ultimate aim of your analysis is? What do you need a bed file for?

Your fasta doesn’t contain enough information to create a bed file so you may need to rethink your data/strategy.

ADD REPLY

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6