Extracting subsequence from FASTA file using python
2
0
Entering edit mode
6.7 years ago
shawnt1234 • 0

Hi I would like to extract subsequences from a large fasta file and make a new fasta file with the extracted seqences using python preferably.

I have a csv file with the following format:

id, start, stop, header
id1, 3, 10, Contig0
id2, 12, 25, Contig1
id3, 19, 40, Contig2

the input fasta file has the following format:

>Contig0
(Contig0 sequence)
>Contig1
(Contig1 sequence)
>Contig2
(Contig2 sequence)

I would like an fasta file output that has the following format:

>id1
(Contig0 sequence from bp 3-10)
>id2
(Contig1 sequence from bp 12-25)
>id3
(Contig2 sequence from bp 19-40)

If anyone has any suggestions or a script that can do this, any help would be greatly appreciated.

fasta sequence python • 3.1k views
ADD COMMENT
2
Entering edit mode
6.7 years ago

It's possible in Biopython

1) Create a dataframe with your csv file (make your id column as index)

2) Iterate over your fasta file using SeqIO

3) For the record you get from your iteration, find the corresponding row in your dataframe (something like : df.loc[[record.id]])

4) Once you have the good row, modify the header record with the row infos

5) Substring and replace the sequence record (record.sequence)

6) Write the record in a new file

7) Step3

I let you try this by your own, if you want some help comment below :)

ADD COMMENT
0
Entering edit mode

Thanks for the help! I wrote a script and it was not very efficient so it ran very slow, so I did some more research and found bedtools getfasta and that worked for me.

ADD REPLY
1
Entering edit mode
6.7 years ago
GenoMax 147k

pyfaidx by Matt Shirley.

ADD COMMENT

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6