Question

Help to create a dataframe in Python from a FASTA file .

0

Entering edit mode

3.5 years ago

Student ▴ 30

Hello.

I want to create a dataframe in Python starting from a FASTA format file. Given the toy FASTA file that I am attaching, I built this program in Python that returns four colums corresponding to id, sequence length, sequence, animal name and rows corresponding to all the data available.

However, I am trying to understand how to modify this code in order to create a dataframe in which classes Human and Dog have the same number of data. For example, I want to say to Python: "Append to record (that is the empty list) id, sequence length, sequence and animal for Human, but do it a number of times that is equal to the number of data of the class with minimum number of data (that is Dog)".

I think that a while loop is needed but I have a bit troubles to understand how to do it. Any suggestion ?

Below the Python code I wrote and the FASTA format file I used.

import pandas as pd
import re
def read_fasta(file_path, columns) :
    from Bio.SeqIO.FastaIO import SimpleFastaParser 
    with open("Proof.txt") as fasta_file :  
        records = [] # create empty list

        for title, sequence in SimpleFastaParser(fasta_file):  #SimpleFastaParser Iterate over Fasta records as string tuples. For each record a tuple of two strings is returned: the FASTA title line (without the leading ‘>’ character),  and the sequence (with any whitespace removed). 
            record = []
            title_splits=re.findall(r"[\w']+", title) # Data cleaning is needed



            record.append(title_splits[0])  #First values are ID (Append adds element to a list)
            record.append(len(sequence)) #Second values are sequences lengths
            sequence = " ".join(sequence) #It converts into one line
            record.append(sequence)#Third values are sequences

            #Fourth column will contain the species
            if "Human" in title_splits:
                    record.append("Human")    
            else:
                    record.append("Dog")



            records.append(record)
    return pd.DataFrame(records, columns = columns) #We have created a function that returns a dataframe

#Now let's use this function by inserting in the first argument the file name (or file path if your working directory is different from where the fasta file is)        
#And in the second one the names of columns
data = read_fasta("Proof.txt", columns=["id","sequence_length", "sequence", "animal"])
data

The FASTA format file is this:

>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK

My code prints a dataframe like:

       id  sequence_length                               sequence animal
0   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
1   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
2   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
3   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
4   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
5   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
6   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
7   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
8   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
9   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
10  Numer               19  H S S F I E I V N I E H V I E H I V K  Human
11  Numer               19  H S S F I E I V N I E H V I E H I V K  Human

But I would like that the number of rows for Human is the same for Dog (because, in other words, I would like the same number of data for each class that are Human and Dog).

Hoping to have been clear, I thank you in advance.

FASTA python dataframe programming data • 3.0k views

ADD COMMENT • link 3.5 years ago by Student ▴ 30

score 2 · Accepted Answer · 2022-01-15

A couple of options:

1) If you don't care about shuffling prior and just want equal numbers of the first entries occuring in your dataframe, with the number of entries limited to the smallest class among 'animal'.

# Based on `grouped.size()` from https://stackoverflow.com/a/17945528/8508004 &
# merge-like step from https://stackoverflow.com/a/68566256/8508004
grouped = data.groupby('animal')
subset_data = (grouped.head(grouped.size().min())).reset_index(drop=True)
subset_data

2) Random shuffle first and then collect equal numbers of entries, with the number of entries limited to the smallest class among 'animal'.

# Based on what is in option #1 with addition of
# shuffle from https://stackoverflow.com/a/34879805/8508004.
shuffled_df = data.sample(frac=1).reset_index(drop=True)
grouped = shuffled_df.groupby('animal')
subset_data = (grouped.head(grouped.size().min())).reset_index(drop=True)
subset_data = subset_data.sort_values("animal").reset_index(drop=True) # OPTIONAL?:added
# sort on grouping column because otherwise coming out mixed although with equal numbers.
subset_data