Over the past few days, I've tried many methods to extract subset of FASTA from a multi-FASTA file based on the header IDs. I've tried samtools, hpcgridrunner, biopython and various other fasta extractor tools. However, none of them worked very well or very efficiently. Using samtools, it took me days to retrieve a 2 million subset from a 55 million multi-fasta sequence file (8.4G in storage).
Today, I finally found a solution to the problem. I wish to share with you so you may help others who have the similar problems. I'm not sure if anyone has posted this solution before. If someone has, please link them to this post as well so they can help people along the way. Using the script below, it only takes about 10 minutes on my server to build a sorted fasta in one-liner format, and extracting a 2 million subset using fasta header IDs within 2 minutes.
The script was run in LINUX system.
The steps are as follows:
#1. Convert a multi-line multi-fasta file <$fasta> into a one-liner multi-fasta <$converted_fasta> and sort the one-liner in numerical order. **(replace $fasta and $converted_fasta with real file names)**
awk '/^>/ {printf("\n%s__",$0);next; } { printf("%s__",$0);} END {printf("\n");}' $fasta | sort -n > $converted_fasta
#2. Sort and make unique your ID headers. (replace $GOOD_ID and $GOOD_ID_sorted with real file names)
sort -n $GOOD_ID | sort -u > $GOOD_ID_sorted
#3. Use the fixed-string fgrep combined with LC_ALL=C command to extract all fasta sequences matched to the headers.
LC_ALL=C fgrep -w -f $GOOD_ID_sorted $fasta | awk -F'__' '{for(i=1; i<NF; i+=1) {print $i}}' > $GOOD_READS.fasta
#4. If you wish to extract reads that were not matched to the header IDs, you may do so with fgrep -v command to grab the inverse matches (unmatched).
LC_ALL=C fgrep -v -w -f $GOOD_ID_sorted $fasta | awk -F'__' '{for(i=1; i<NF; i+=1) {print $i}}' > $UNMATCHED_READS.fasta
The total runtime for these procedures is about 10 minutes. I hope this post can help someone in the future. Please comment below or fix the code as I am a beginner in the field of computer science and bioinformatics.
Update: I updated the script so it now keeps the original fasta structure. It basically converts all line symbol "\n" into "__" and then in step 3, it converts "__" back to "\n". Please note if you do have symbol "__" in your fasta lines, please replace "__" symbol with something that do not exist in the lines.
faSomeRecords from Jim Kent's Utilities (UCSC).
This is a terrific and wonderful piece of software. So efficient and rapid, that sometimes you believe that it did not work, although it did its jobs actually
From BBMap:
filterbyname.sh in=[file] out=[outfile] names=[string,string,string] include=[true or false]
It works on fasta, fastq, and sam. The names can either be literal names (like "E.coli") or files containing one name per line. It also allows substring matching, such as "NC_000913.3" when all you know is the accession number and the full name looks like "ncbi|511145 NC_000913.3 Escherichia coli str. K-12 substr. MG1655". The include toggle lets you either keep only the names in the list, or keep everything else.
Does this one work with retrieval based on Query FASTA IDs? For example, I have a long list of query IDs from a Illumina sequencing file, does BBMap efficiently retrieve a subset of the Illumina read FASTA?
Yes, it does; filterbyname is extremely fast (except in substring mode). Just be sure that the query names are either in a proper sequence file (fasta or fastq) or else they are in a text file with one query per line, not starting with ">" or "@" symbols which are not part of the sequence names.
Although, you can use the "ths" flag (truncateheadersymbol) to ignore a leading ">" or "@" if you provide a text file containing headers that include those symbols.
Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing