How to covert fasta file into pandas dataframe with parallel running python script?
0
0
Entering edit mode
3.8 years ago
sunyeping ▴ 110

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)
The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1DYwhzUDH0LNgZXFuY2ud0CWkWLL9SBid/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?

with many thanks!

alignment • 3.9k views
ADD COMMENT
0
Entering edit mode

I think that parallel computation may not be from the code but the configuration on the work station.

ADD REPLY

Login before adding your answer.

Traffic: 2915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6