Hello.
I want to create a dataframe in Python starting from a FASTA format file. Given the toy FASTA file that I am attaching, I built this program in Python that returns four colums corresponding to id, sequence length, sequence, animal name and rows corresponding to all the data available.
However, I am trying to understand how to modify this code in order to create a dataframe in which classes Human and Dog have the same number of data. For example, I want to say to Python: "Append to record (that is the empty list) id, sequence length, sequence and animal for Human, but do it a number of times that is equal to the number of data of the class with minimum number of data (that is Dog)".
I think that a while
loop is needed but I have a bit troubles to understand how to do it. Any suggestion ?
Below the Python code I wrote and the FASTA format file I used.
import pandas as pd
import re
def read_fasta(file_path, columns) :
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open("Proof.txt") as fasta_file :
records = [] # create empty list
for title, sequence in SimpleFastaParser(fasta_file): #SimpleFastaParser Iterate over Fasta records as string tuples. For each record a tuple of two strings is returned: the FASTA title line (without the leading ‘>’ character), and the sequence (with any whitespace removed).
record = []
title_splits=re.findall(r"[\w']+", title) # Data cleaning is needed
record.append(title_splits[0]) #First values are ID (Append adds element to a list)
record.append(len(sequence)) #Second values are sequences lengths
sequence = " ".join(sequence) #It converts into one line
record.append(sequence)#Third values are sequences
#Fourth column will contain the species
if "Human" in title_splits:
record.append("Human")
else:
record.append("Dog")
records.append(record)
return pd.DataFrame(records, columns = columns) #We have created a function that returns a dataframe
#Now let's use this function by inserting in the first argument the file name (or file path if your working directory is different from where the fasta file is)
#And in the second one the names of columns
data = read_fasta("Proof.txt", columns=["id","sequence_length", "sequence", "animal"])
data
The FASTA format file is this:
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
My code prints a dataframe like:
id sequence_length sequence animal
0 Numer 19 H S S F I E I V N I E H V I E H I V K Human
1 Numer 19 H S S F I E I V N I E H V I E H I V K Human
2 Numer 19 H S S F I E I V N I E H V I E H I V K Human
3 Numer 19 H S S F I E I V N I E H V I E H I V K Human
4 Numer 19 H S S F I E I V N I E H V I E H I V K Dog
5 Numer 19 H S S F I E I V N I E H V I E H I V K Dog
6 Numer 19 H S S F I E I V N I E H V I E H I V K Dog
7 Numer 19 H S S F I E I V N I E H V I E H I V K Dog
8 Numer 19 H S S F I E I V N I E H V I E H I V K Human
9 Numer 19 H S S F I E I V N I E H V I E H I V K Human
10 Numer 19 H S S F I E I V N I E H V I E H I V K Human
11 Numer 19 H S S F I E I V N I E H V I E H I V K Human
But I would like that the number of rows for Human is the same for Dog (because, in other words, I would like the same number of data for each class that are Human and Dog).
Hoping to have been clear, I thank you in advance.
Thank you very much !