Hi all!
I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.
for file in `find text_files | grep .txt`;
do
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa
done
Any ideas on better ways to go about this? particularly to speed things up.
Thank you for your suggestions.
You should show examples of the protein names from the protein names files and from the fasta file.
Also, did you edit the
find text_files | grep .txt
part? I don't think this command would find anything at all.Anyway, I think
would be faster than
edit: you can also do
edit 2: I didn't state clearly, but I believe what is slowing your command is
xargs -n 1
, which is not necessary anyway.Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution
was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.
Thanks again!
faSomeRecords
utility from Jim Kent at UCSC should be a fast way to do this.Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file