copy files with >50 sequences to a folder
6
1
Entering edit mode
8.5 years ago
natasha ▴ 110

Hi

I have a folder with 4000 files in it. I would like make a new folder, and copy into it all of the files which contain >50 fasta sequences. How do I do this?

I know that I need to create a simple loop, then grep '>' | wc -l and select only those with >50. But I am new to programming and am unsure how to write this properly.

fasta loop Bash • 3.8k views
ADD COMMENT
0
Entering edit mode

There are a number of solutions posted below. Please choose as many you like as "acceptable" (use the check mark against the answers below). All of them should work for your question.

It is a good practice to accept "solutions" as it shows your appreciation for the effort people put in to bring solutions for your questions.

ADD REPLY
7
Entering edit mode
8.5 years ago
venu 7.1k

Here is how

grep -c '^>' *.fasta | sed 's/:/ /' | awk '$2>50 {print $1}' | xargs mv -t new_folder/

Update

grep -c '^>' *.fasta | sed 's/:/ /' | awk '$2>50 {print $1}' | xargs cp new_folder/

P.S: Updated as @genomax2 pointed out.

ADD COMMENT
0
Entering edit mode

Since @natasha asked for "copy" xargs cp new_folder may be the better.

ADD REPLY
0
Entering edit mode

Oops! Thanks for pointing out.

ADD REPLY
0
Entering edit mode

@venu it would be better if you just edit the command as @genomax pointed out.

ADD REPLY
5
Entering edit mode
8.5 years ago

This is a very interesting question, but you should implement it without a FOR loop, which are a bad thing.

For example you can do it with find:

mkdir -p smallfiles_folder/
find .  -type f  -print0 | xargs -0 grep -m 50 -o -H -c '>'  | gawk -F":" '$2 > 2 {print $1} ' | grep -v 'smallfiles_folder' | xargs -i cp {} smallfiles_folder/

Explanation:

# find all files in current folder. Add -iname "*fasta" to restrict to fasta files only
find .  -type f  -print0 |     

  # count number of fasta headers in file. The -m 50 option is used to stop grepping after 50 matches, saving some calculation time) \
 xargs -0 grep -m 50 -o -H -c '^>'  | 

 # Use files to select files with less than 50 matches
 gawk -F":" '$2 < 50 {print $1} ' | 

 # Remove target folder (avoid potential recursive loops)
 grep -v 'smallfiles_folder' | 

 # copy / move files to target folder
 xargs -i cp {} smallfiles_folder/
ADD COMMENT
1
Entering edit mode

Small nitpick. Even though the explanation part has it right the -m option for grep in actual command is set to 5 instead of 50. Interesting use of -m, makes sense.

Why would a for loop be bad?

ADD REPLY
0
Entering edit mode

If I understand correctly it should be the computing time, if one is intending parallelization then the entire creation should be done using xargs rather than while or for loop that will process one sample at a time. So it not only the target folder creation but also the counting of the fasta headers are also in parallel. This is what I understand. Correct me if I am wrong.

ADD REPLY
0
Entering edit mode

Yes, parallelization is the answer. In this case IO may still be a limiting factor, but as general rule it is better to get used to avoiding for loops in general.

ADD REPLY
0
Entering edit mode

Would be interesting to see if there's any speed up with how IO heavy something like this is.

ADD REPLY
0
Entering edit mode

Thanks for spotting the typo, I've fixed it now.

ADD REPLY
4
Entering edit mode
8.5 years ago

loop over the files in DIR1, use awk to print the files having more than 50 fasta sequences, read and copy those files

$ ls DIR1/*.fa | while read F; do  awk '/^>/ {N++;} END {if(N>50) printf("%s\n",FILENAME);}' "${F}" | while read F2; do cp "$F2" DIR2 ; done ; done
ADD COMMENT
2
Entering edit mode
8.5 years ago
pld 5.1k
from=$1
to=$2
minlen=$3
for f in $from
do
    if [ (grep ">" | wc -l) -gt $minlen ]
    then
        cp $f $to
    fi
done

Suppose you save this as move.sh, you'd run it as:

move.sh {directory with files} {directory you want them moved to} 50
ADD COMMENT
1
Entering edit mode

grep has an option -c to count the occurrence, so grep ">" | wc -l can be written as grep -ce ">"

ADD REPLY
0
Entering edit mode

I don't know why I always forget grep can count.

ADD REPLY
0
Entering edit mode

Because grep being able to count is the UNIX equivalent of feature creep :P

ADD REPLY
0
Entering edit mode

@joe: "mv" should be replaced with a "cp" to match the original request.

ADD REPLY
0
Entering edit mode

Yep, fixed it. Really need to not post first thing in the morning.

ADD REPLY
1
Entering edit mode
8.5 years ago
5heikki 11k
mkdir newDir; for f in *.fna; do seqCount=$(grep -c "^>" $f); if [ "$seqCount" -gt 50 ]; then cp $f newDir; fi; done
ADD COMMENT
1
Entering edit mode
8.5 years ago
Prakki Rama ★ 2.7k

Copying files with more than 50 sequences into another directory

 mkdir target_folder \| 
 cp -t target_folder $(LC_ALL=C fgrep -c '>' *.fasta | sed 's/:/\t/g' | awk '$2 > 50 {print $0}' | cut -f1)

1st line: mkdir creates a target folder

2nd line: fgrep counts the sequences in each file, AWK will choose those files with more than 50 sequences, cp -t command copies the files into target directory

In the case of moving files into a folder instead of copying, replace cp -t with mv -t

ADD COMMENT
0
Entering edit mode

Why is everyone posting a mv solution when the original request is for copy?

ADD REPLY
0
Entering edit mode

exactly my point. It seems not everyone is reading the post properly.

ADD REPLY
0
Entering edit mode

Original request is for more than 50 ("contain >50 fasta sequences"). A few solutions go with ≥ 50.

ADD REPLY
0
Entering edit mode

@genomax2, @vchris_ngs : Sorry for overlooking! I updated it now. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6