Question

split fasta file to train deep learning model

1

Entering edit mode

20 months ago

pinheirofabiano ▴ 100

I have a fasta file and I need to split the file in two at random (80% of the sequences in one file, 20% in the other) to train my deep learning model. I would like to do it using R or Bash. Can anybody help me?

thank you very much, Fabiano

fasta split bash R • 1.9k views

ADD COMMENT • link updated 20 months ago by shenwei356 8.7k • written 20 months ago by pinheirofabiano ▴ 100

1

Entering edit mode

reformat.sh from BBTools as well.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

Edit: I am going to move this to a comment since it sounds like OP has a single fasta file and wants to split the file into two pieces. If that is the case then this may not be the tight tool.

ADD REPLY • link 20 months ago by GenoMax 147k

1

Entering edit mode

This isn't what you are asking, but I will give your my opinion just in case.

If you have a single genome, doing a random 80:20 split as you want is probably fine. If you have multiple genomes, I don't think random splitting is optimal. You can end up with some genomes not being represented enough, or at all, in the validation split. It may be better to do stratified sampling, which is concatenating genomes sequentially and taking every 5th sequences as they are ordered in a group file.

Separately, doing a single holdout validation as you seem to be planning may not be a best idea. Whether it is done randomly or in stratified fashion, there is no guarantee that sequence distributions will be the sufficiently similar in a single split. Unless your training takes many weeks or longer, I suggest you do a cross-validation (say, 5-fold). The average of results for all folds will give you a better estimate of the model performance on unseen data, and the average of predictions is also likely to be better than predictions from a single model.

ADD REPLY • link 20 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

thank you very much for your comment, @Mensur My file is just a list of peptides, not a complete genome

ADD REPLY • link 20 months ago by pinheirofabiano ▴ 100

score 3 · Answer 1 · 2023-03-19

3

Entering edit mode

20 months ago

Matthias Zepper 5.0k

seqkit sample should be what you a looking for.

ADD COMMENT • link 20 months ago by Matthias Zepper 5.0k

0

Entering edit mode

I'm trying to split my file in two, using Seqkit, but it is not working.. using the command sample, I can generate a file that contains 80% of the original file at random... but how can I get the second file with the remaining 20%?

ADD REPLY • link 20 months ago by pinheirofabiano ▴ 100

3

Entering edit mode

shuffle the sequences, with a rand seed 1.

seqkit shuffle --two-pass --rand-seed 1 hairpin.fa -o hairpin.s1.fa

testing data

seqkit sample -p 0.2 hairpin.s1.fa -o test.1.fasta

training data (just excluding the testing data).

# id of test data
seqkit seq -n -i test.1.fasta -o test.1.fasta.id.txt

# training data
seqkit grep -v -f test.1.fasta.id.txt hairpin.s1.fa -o train.1.fasta

ADD REPLY • link 20 months ago by shenwei356 8.7k

0

Entering edit mode

@shenwei356, thank you very much for your help, perfect!

But now I realized that some fasta sequences contain the letter "B", which is outside the 20 standard amino acids symbols.

How can I exclude the fasta sequences that contain the letter "B", before splitting the file?

thank you very much, Fabiano

ADD REPLY • link 20 months ago by pinheirofabiano ▴ 100

2

Entering edit mode

Excluding sequences containing any letter not belonging to the 20 amino acids letters using seqkit grep

seqkit grep -s -v -r -p "[^acdefghiklmnpqrstvwyACDEFGHIKLMNPQRSTVWY]" input.fasta.gz -o filtered.fasta.gz

Options:

  -s, --by-seq                 search subseq on seq, both positive and negative strand are searched, and
                               mismatch allowed using flag -m/--max-mismatch
  -v, --invert-match           invert the sense of matching, to select non-matching records
  -r, --use-regexp             patterns are regular expression
  -p, --pattern strings        search pattern (multiple values supported. Attention: use double
                               quotation marks for patterns containing comma, e.g., -p '"A{2,}"')

ADD REPLY • link 20 months ago by shenwei356 8.7k