Question

Help with randomly sampling from a fasta file

0

Entering edit mode

7.7 years ago

tpaisie ▴ 80

Hey guys, I'm trying to make a script that will take a fasta file with many sequences from one patient at different time points and then randomly sample one sequence from each time point. For example here are some sequence title names:

01P03Pr01
01P03Pr02
01P03Pr03
01P03Pr04
01P03Kr01
01P03Kr02
01P03Kr03
09P03Pr01
09P03Pr02
09P03Pr03
09P03Kr01
09P03Kr05

Then these are the random sequences that were taken out of the larger fasta file and put in a new one:

01P03Pr02
01P03Kr01
09P03Pr03
09P03Kr05

Hopefully that makes sense. I'm a beginner with coding in python and want to improve so I would appreciate a nudge or some help. I'm using random.sample in my script and i'm not really sure where to start in terms of whether or not to make a dictionary or index, or none of those. Any help would be appreciated!!!

sequencing fasta python • 6.7k views

ADD COMMENT • link updated 7.7 years ago by GenoMax 147k • written 7.7 years ago by tpaisie ▴ 80

2

Entering edit mode

Have a try seqtk sample

ADD REPLY • link 7.7 years ago by pengchy ▴ 450

0

Entering edit mode

Are all the sequences in a single FASTA file? If so, you may explain the structure of sequence names

01P03Pr04

What are the time points and patient ID and body parts? 01, P03Rr, 04??

If not, just sampling every file.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

so for the sample you wrote its:

01 = month of sampled P03 = Patient ID P = kind of tissue r = RNA 04 = sequence number

I need a script that prompts me for how many sequences I would like from each tissue. The example I put in the OP is asking for one randomly sampled sequence per tissue per time point.

ADD REPLY • link 7.7 years ago by tpaisie ▴ 80

0

Entering edit mode

read sequences using Biopython, parse the name to get month and tissue information, save sequences in dict with structure like seqs[month][tissue]=list of sequences, and sample for per tissue per time point.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks for the tip! I will try to write a script for this and if it doesn't work i'll post my script.

ADD REPLY • link 7.7 years ago by tpaisie ▴ 80

score 3 · Answer 1 · 2017-03-14

3

Entering edit mode

7.7 years ago

GenoMax 147k

Use reformat.sh from BBMap suite.

To sample 10% of reads

reformat.sh in=reads.fa out=sampled.fa samplerate=0.1

and for exact sampling (100 reads):

reformat.sh in=reads.fa out=sampled.fa samplereadstarget=100

ADD COMMENT • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

So I don't want to just sample 10% of reads nor are these fastq files, its a MSA fasta file. I'll have 100 sequences in my fasta file from one patient, already aligned, and 25 sequences will be from the same patient from the same body part at different time points. So that means there would be 4 different sets of 25 sequences from the same patient each from a different body part taken at different time points. I want to randomly sample 1 sequence from each set of 25 and end up with 4 sequences in my output fasta file. Does that make sense?

ADD REPLY • link 7.7 years ago by tpaisie ▴ 80

1

Entering edit mode

If you noticed in the example I provided above I am using fasta (.fa) files. Since aligned fasta files are not a standard format (and depending on how they are broken e.g. line wrapping) reformat.sh may (or may not) work. Won't be that much effort to try since BBMap does not need any install or dependencies (besides java). If you have 4 sets of 25 sequences then you should split them into 4 files and then sample 1 sequence out of the 4 files independently reformat.sh in=reads_1.fa out=sampled_1.fa samplereadstarget=1 I don't recollect if BBMap will append a file but you can always cat those 4 sampled files together.

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

Yeah I know i can do that, but the point is for me to write a python script for doing what I want to do from one fasta file.

ADD REPLY • link 7.7 years ago by tpaisie ▴ 80

2

Entering edit mode

If you are writing a script yourself then it is generally customary to include that as well in your post, if you are running into difficulties. Also make it clear in the original post (by editing it) that you want to do this yourself and don't want ready made/other solutions.

We are generally suspicious when solutions based on a specific technology (e.g. python) are requested when others could be easily used. But learning is always encouraged on Biostars and as long as help is not being requested for homework assignments people will chip-in to help.

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

How can we recognize the times points and body parts from the sequence names, you should tell us all.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k