Hi all, I need some help. In fact in the contexte of my work I have a dataframe of seq names paired together, in each row there is one column for the seq_id of the sp1 and one fore the seq _idof the sp2. In another hand I have two fasta files which contain all these sequences (same seq Id) + the sequences in fasta format. But in these files sequences are totaly mixed and what I need to do is to reorganize two new fasta file by parsing my dataframe and say, ok for each row, put the seqx_A in fasta file 1 and seqx_B in fasta file 2. By keeping the order in the dataframe. Here is an exemple:
I actually have one dataframe with sequences in order such :
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and two fasta files such :
first one
:
>Seq11_B
ACTG
>seq8_B
ATGC
>seq3_A
ACTG
>seq2_A
ATGC
second one:
>seq4_A
ACTG
>seq1_A
ACTG
>Seq10_B
ATGC
>Seq9_B
ATCG
As you can see _A and _B are mixed in bot fasta file but I would like to order my fastafiles by creating a new ones and put all seq A in a file and all seqB in another file in the same order as in the dataframe (paires sequence as always to be added in the same time in the file). here would be the output of the exemple:
fasta1:
>seq1_A
ATGC
>seq2_A
ATGG
>seq3_A
ATGC
>seq4_A
ATGC
and fasta2:
>seq8_B
ATGc
>Seq9_B
ATGC
>Seq10_B
ATGC
>Seq11_B
ATGC
Here would be the name of the files:
candidate_df.read_csv("dn_ds.out_test",sep='\t')
#--------------------------------------
#Load the sequences comming from the cluster filtering and range them into ordered files per species
#here is the two columns of the dataframe
seq1_id=candidate_df["seq1_id"]
seq2_id=candidate_df["seq2_id"]
#Here is the output desired files:
output_aa_sp1 = open('candidates_aa_0042.fasta','w')
output_aa_sp2 = open('candidates_aa_0035.fasta','w')
#Here are the 2 fasta file to be modified
record_dict_sp1_aa = SeqIO.to_dict(SeqIO.parse("result1_aa.fasta", "fasta"))
record_dict_sp2_aa = SeqIO.to_dict(SeqIO.parse("result2_aa.fasta", "fasta"))
Does someone have an idea?
Thank you :)
Hi think it is working, do you know how can I count how many record I have in a fasta file with biopython?
You mean in
new_fasta1
andnew_fasta2
?Or in general ?
BioPython is powerful but slow and, perhaps, overkill for this task.
Indeed, if time really matter to him, he should swap to high level language as C. But he mentions python so I gave him this solution. He also could have write something in python from scratch like you did but i'm not even sure that would be quicker. I know, by testing, that Biopython is memory consuming. I'm just curious, do you have a paper on hand on this biopython downside ? Thanks !
I do not have a paper on this, I'm sorry. I'm just relating past (anecdotal) experience processing FASTA files with this library.