I have a fasta file and I need to split the file in two at random (80% of the sequences in one file, 20% in the other) to train my deep learning model. I would like to do it using R or Bash. Can anybody help me?
thank you very much, Fabiano
I have a fasta file and I need to split the file in two at random (80% of the sequences in one file, 20% in the other) to train my deep learning model. I would like to do it using R or Bash. Can anybody help me?
thank you very much, Fabiano
seqkit sample should be what you a looking for.
shuffle the sequences, with a rand seed 1.
seqkit shuffle --two-pass --rand-seed 1 hairpin.fa -o hairpin.s1.fa
testing data
seqkit sample -p 0.2 hairpin.s1.fa -o test.1.fasta
training data (just excluding the testing data).
# id of test data
seqkit seq -n -i test.1.fasta -o test.1.fasta.id.txt
# training data
seqkit grep -v -f test.1.fasta.id.txt hairpin.s1.fa -o train.1.fasta
@shenwei356, thank you very much for your help, perfect!
But now I realized that some fasta sequences contain the letter "B", which is outside the 20 standard amino acids symbols.
How can I exclude the fasta sequences that contain the letter "B", before splitting the file?
thank you very much, Fabiano
Excluding sequences containing any letter not belonging to the 20 amino acids letters using seqkit grep
seqkit grep -s -v -r -p "[^acdefghiklmnpqrstvwyACDEFGHIKLMNPQRSTVWY]" input.fasta.gz -o filtered.fasta.gz
Options:
-s, --by-seq search subseq on seq, both positive and negative strand are searched, and
mismatch allowed using flag -m/--max-mismatch
-v, --invert-match invert the sense of matching, to select non-matching records
-r, --use-regexp patterns are regular expression
-p, --pattern strings search pattern (multiple values supported. Attention: use double
quotation marks for patterns containing comma, e.g., -p '"A{2,}"')
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
reformat.sh
from BBTools as well.Edit: I am going to move this to a comment since it sounds like OP has a single fasta file and wants to split the file into two pieces. If that is the case then this may not be the tight tool.
This isn't what you are asking, but I will give your my opinion just in case.
If you have a single genome, doing a random 80:20 split as you want is probably fine. If you have multiple genomes, I don't think random splitting is optimal. You can end up with some genomes not being represented enough, or at all, in the validation split. It may be better to do stratified sampling, which is concatenating genomes sequentially and taking every 5th sequences as they are ordered in a group file.
Separately, doing a single holdout validation as you seem to be planning may not be a best idea. Whether it is done randomly or in stratified fashion, there is no guarantee that sequence distributions will be the sufficiently similar in a single split. Unless your training takes many weeks or longer, I suggest you do a cross-validation (say, 5-fold). The average of results for all folds will give you a better estimate of the model performance on unseen data, and the average of predictions is also likely to be better than predictions from a single model.
thank you very much for your comment, @Mensur My file is just a list of peptides, not a complete genome