Fast way to extract specific sequences from large fasta
2
0
Entering edit mode
5.5 years ago
mnsp088 ▴ 100

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
do
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa
done

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

sequencing fasta • 3.4k views
ADD COMMENT
1
Entering edit mode

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

edit 2: I didn't state clearly, but I believe what is slowing your command is xargs -n 1, which is not necessary anyway.

ADD REPLY
1
Entering edit mode

Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.

Thanks again!

ADD REPLY
1
Entering edit mode

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLY
3
Entering edit mode
3.0 years ago

I know this post is old but I would suggest this approach using seqkit and GNU parallel:

The key here is to split the big fasta file into several parts

For your particular case I would do something like that

seqkit split2 -p 100 big.fasta -O tmp # the splitted fasta in put into a folder called tmp

parallel -j 20 " zcat {1} | seqkit grep -n -f {2} >> new.fa" ::: tmp/*.fasta ::: folder_protein_files/*.txt # you might need to tweak it because I don't know your exact setup

Using this approach you first split your fasta file into several parts and process them in parallel using seqkit and GNU parallel. First use parallel --dryrun to test your code.

hope it helps,

Loïc

ADD COMMENT
1
Entering edit mode
5.5 years ago
Joe 21k

Indexing is almost certainly the right way to do this. Even with biopython which will typically be slower, its likely do-able using SeqIO's index() method ( https://biopython.org/wiki/SeqIO ):

from Bio import SeqIO

index = SeqIO.index('bigfasta.fa', 'fasta')
print(index['myfastaID'])

I don't have a massive fasta file to hand to actually test this performance though.

It will likely still be slower than faidx or faSomeRecords implemented in straight up faster languages.

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6