Python: How to find coordinates of short sequences in a FASTA file?
4
0
Entering edit mode
9.5 years ago

I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences.

Fasta file:

>PGH2
CGTAGCGGCTGAGTGCGCGGATAGCGCGTA

Short sequence fasta file:

>PGH2
CGGCTGAGT

Is there any ways to obtain its coordinate? Bedtools can't help much.

Desired output:

PGH2 6 14
fasta bed biopython sequence • 5.1k views
ADD COMMENT
0
Entering edit mode

@allyson1115ar: I have a feeling you are somewhat confused or you are confusing me with your question. The file format that you have shown above is a FASTA format. FASTA format hold sequence information, where string after > explain what sort of sequence that is e.g it can be a gene sequence, a sequenced read sequence, contig sequence. The gene (or any other features such as CDS, UTR's etc) coordinates, that previously have been annotated usually kept in gff/gtf or bed files.

It is a good practice to give a bit more background information on forums to what you are doing. At least provide species name and what is PGH2, a gene or contig or something else?

If those are the reads you got from sequencing and you are interested in identifying whereabouts those read fall on the reference genome than your first step will be read alignment to the reference genome. This will give a bam file with start and end coordinates of where about your reads land in the reference genome.

ADD REPLY
0
Entering edit mode
9.5 years ago
PoGibas 5.1k

I had the same problem long time ago and Farhat solution was what helped me: 7N Motif Search Over The Genome

You can edit script a little or just easily parse output for wanted bed format.

ADD COMMENT
0
Entering edit mode
9.5 years ago
Ram 44k

Either Pgibas's solution, or you can create rigid motif matrices out of your sequences and use MEME's FIMO.

By "rigid", I mean where the probability for one base is 1.0 and the others is 0.0, as opposed to the flexibility offered by most motif position weight matrices.

ADD COMMENT
0
Entering edit mode
9.5 years ago
alec_djinn ▴ 390

It is not a difficult problem to solve, all you need is to use string.find().

The basic algorithm is something like that in python.

fasta = 'CGTAGCGGCTGAGTGCGCGGATAGCGCGTA'
target = 'CGGCTGAGT'

start = fasta.find(target)+1
end = start + len(target) -1

print(start, end)
ADD COMMENT

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6