Entering edit mode
3.8 years ago
sunyeping
▴
110
I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:
from Bio import SeqIO
import pandas as pd
def fasta2df(infile):
records = SeqIO.parse(infile, 'fasta')
seqList = []
for record in records:
desp = record.description
# print(desp)
seq = list(record.seq._data.upper())
seqList.append([desp] + seq)
seq_df = pd.DataFrame(seqList)
print(seq_df.shape)
seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
return seq_df
if __name__ == "__main__":
path = 'path/to/the/fasta/file'
input = path + 'GISAIDspikeprot0119.selection.fasta'
df = fasta2df(input)
The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1DYwhzUDH0LNgZXFuY2ud0CWkWLL9SBid/view?usp=sharing
The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?
with many thanks!
I think that parallel computation may not be from the code but the configuration on the work station.