Hey guys, I'm trying to make a script that will take a fasta file with many sequences from one patient at different time points and then randomly sample one sequence from each time point. For example here are some sequence title names:
01P03Pr01
01P03Pr02
01P03Pr03
01P03Pr04
01P03Kr01
01P03Kr02
01P03Kr03
09P03Pr01
09P03Pr02
09P03Pr03
09P03Kr01
09P03Kr05
Then these are the random sequences that were taken out of the larger fasta file and put in a new one:
01P03Pr02
01P03Kr01
09P03Pr03
09P03Kr05
Hopefully that makes sense. I'm a beginner with coding in python and want to improve so I would appreciate a nudge or some help. I'm using random.sample in my script and i'm not really sure where to start in terms of whether or not to make a dictionary or index, or none of those. Any help would be appreciated!!!
Have a try
seqtk sample
Are all the sequences in a single FASTA file? If so, you may explain the structure of sequence names
What are the time points and patient ID and body parts?
01
,P03Rr
,04
??If not, just sampling every file.
so for the sample you wrote its:
01 = month of sampled P03 = Patient ID P = kind of tissue r = RNA 04 = sequence number
I need a script that prompts me for how many sequences I would like from each tissue. The example I put in the OP is asking for one randomly sampled sequence per tissue per time point.
read sequences using Biopython, parse the name to get month and tissue information, save sequences in
dict
with structure likeseqs[month][tissue]=list of sequences
, and sample for per tissue per time point.Thanks for the tip! I will try to write a script for this and if it doesn't work i'll post my script.