I have a fasta file with 100 random permutations of a genome. It is then broken up into ORFs. Here is a sample:
>Shuffled1_1 [5 - 79] Based on NC_000908.2
>MNGLISMLIYFDEYENEIFVEKKDL
>Shuffled1_2 [36 - 101] Based on NC_000908.2
>MMSMKMKFSLKKKICNYIPKLQ
>Shuffled1_3 [19 - 111] Based on NC_000908.2
>MNVNLLWWVWKWNFRWKKRFVTTYQSSNKLK
>Shuffled1_4 [105 - 134] Based on NC_000908.2
>MKVERQISTL
>Shuffled1_5 [148 - 177] Based on NC_000908.2
>MINNSSSPAL
>Shuffled1_6 [174 - 242] Based on NC_000908.2
>MINKFPELLLTQLSYVRYYLFAI
There are over 3 million records. I need to calculate the 95th percentile by averaging the estimates of each shuffled permutation. I am having trouble figuring out how to get the 95th percentile of shuffled1 before moving onto shuffled2 and so on. Any help would be appreciated. I was able to pull out just the first shuffled permutation using awk and get the percentile using this code:
from Bio import SeqIO
import numpy as np
seqLen=[]
fasta_sequence = SeqIO.parse(open("firstperm.fasta"),"fasta")
for fasta in fasta_sequence:
name,sequence = fasta.id, str(fasta.seq)
seqLen.append(len(fasta))
a = np.array(seqLen)
p = np.percentile(a,95)
print p
I am just trying to figure out a way to duplicate this over the entire fasta file for all 100 shuffled permutations. Basically I am looking for 100 numbers in a list for each 95th percentile.
Thanks this works! Appreciate it!