I have a folder with 4000 files in it. I would like make a new folder, and copy into it all of the files which contain >50 fasta sequences. How do I do this?
I know that I need to create a simple loop, then grep '>' | wc -l and select only those with >50. But I am new to programming and am unsure how to write this properly.
There are a number of solutions posted below. Please choose as many you like as "acceptable" (use the check mark against the answers below). All of them should work for your question.
It is a good practice to accept "solutions" as it shows your appreciation for the effort people put in to bring solutions for your questions.
# find all files in current folder. Add -iname "*fasta" to restrict to fasta files only
find . -type f -print0 |
# count number of fasta headers in file. The -m 50 option is used to stop grepping after 50 matches, saving some calculation time) \
xargs -0 grep -m 50 -o -H -c '^>' |
# Use files to select files with less than 50 matches
gawk -F":" '$2 < 50 {print $1} ' |
# Remove target folder (avoid potential recursive loops)
grep -v 'smallfiles_folder' |
# copy / move files to target folder
xargs -i cp {} smallfiles_folder/
Small nitpick. Even though the explanation part has it right the -m option for grep in actual command is set to 5 instead of 50. Interesting use of -m, makes sense.
If I understand correctly it should be the computing time, if one is intending parallelization then the entire creation should be done using xargs rather than while or for loop that will process one sample at a time. So it not only the target folder creation but also the counting of the fasta headers are also in parallel. This is what I understand. Correct me if I am wrong.
Yes, parallelization is the answer. In this case IO may still be a limiting factor, but as general rule it is better to get used to avoiding for loops in general.
2nd line:fgrep counts the sequences in each file, AWK will choose those files with more than 50 sequences, cp -t command copies the files into target directory
In the case of moving files into a folder instead of copying, replace cp -t with mv -t
There are a number of solutions posted below. Please choose as many you like as "acceptable" (use the check mark against the answers below). All of them should work for your question.
It is a good practice to accept "solutions" as it shows your appreciation for the effort people put in to bring solutions for your questions.