Question

Tutorial:Extracting subset of records from FASTA/FASTQ files based on exact/pattern matches of IDs (ONE-LINERS)

4

Entering edit mode

10.6 years ago

umer.zeeshan.ijaz ★ 1.8k

Here is the tutorial.

For a change, I have also given detailed explanation of how I construct my one-liners. Btw, I initially wrote the tutorial on biostars but it wouldn't let me post it because of exceeding word limit, so I just hosted it on my website.

Best Wishes,
Umer

Edited 18/4/2014: Updated the tutorial to discuss --buffer-size and filtering out contigs from velvet assemblies

fastq fasta • 11k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

There is also bioawk and biopieces

ADD REPLY • link 10.6 years ago by Martin A Hansen 3.0k

1

Entering edit mode

Yeah, there are many tools out there that one can use. Infact I was using fastagrep.pl before. I should have emphasised on "in the absence of any software how you grep for records". Furthermore, my intention is to show how powerful pcregrep is (and hence I have put this post as a tutorial and not a tool). Probably on the same principles, using extended perl regular expressions, people can extract records from other formats, e.g., extracting alignments from the output files generated from water and needle in EMBOSS utilities.

Best Wishes, Umer

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

2

Entering edit mode

Good documents on perl regex. However, pcregrep is part of PCRE. It is not a standard Unix tool, either. In addition, PCRE should be used with caution. It is very slow on some patterns due to the defect in its algorithm, much slower than grep. I am also not sure how pcregrep works if we let it match thousands of IDs. Does it tries each pattern on each line of input? That would be very slow. Fgrep is fast because it builds a finte-state machine for the patterns instead of attempting each pattern against each line.

As to pure Unix one-liner - to extract from FASTA:

awk 'BEGIN{while((getline<"id.txt")>0)l[">"$1]=1}/^>/{f=l[$1]?1:0}f'

To extract from 4-line FASTQ:

awk 'BEGIN{while((getline<"id.txt")>0)l["@"$1]=1}NR%4==1{f=l[$1]?1:0}f'

In daily works, I mostly use seqtk subseq instead of these one-liners.

(PS: I have problems with the new editor; much prefer the old MarkDown editor)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by lh3 33k

0

Entering edit mode

I don't believe it is "very" slow. I have generated some timings below. Yes, it is not available in most standard distros. I am not using experimental PCRE that should be used with caution and I also encourage others not to use them until they are stable enough. As for your question on how it will work with thousands of IDs, there is an example given at the end of this comment (extracting contigs > 1000bp). Furthermore, I believe that the version I am using is running smoothly, though I'd appreciate it if you could provide me with a reference for the defect that you have mentioned.

I have a contig file with 93587 sequences:

$ grep -c ">" contigs.fa
93587

Here is the last entry in the contigs file:

$ grep -i ">" contigs.fa | tail -1
>NODE_143740_length_141_cov_2.468085

And we will use that in IDs.txt

$ cat IDs.txt
NODE_143740_length_141_cov_2.468085

Here is the timing for what you wrote:

$ time awk 'BEGIN{while((getline<"IDs.txt")>0)l[">"$1]=1}/^>/{f=l[$1]?1:0}f' contigs.fa
>NODE_143740_length_141_cov_2.468085
CCGAGATCTACACTAGATCGCTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGGAATGGGATGGAATGGAATGGAATGAAATAGAGATGAGTGGGGTGGAGGGGATTGCAGTGGAATGGAGCAGATTGGAATGGGATGGAGTGGAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGTACTAGATCTCGTATGCc

real    0m2.504s
user    0m2.469s
sys    0m0.030s

Here is the timing for FASTAgrep

$ alias FASTAgrep="awk '{gsub(\"_\",\"\\\_\",\$0);\$0=\"(?s)^>\"\$0\".*?(?=\\\n(\\\z|>))\"}1' | pcregrep -oM -f -"

$ time cat IDs.txt | FASTAgrep --buffer-size=100000000 contigs.fa
>NODE_143740_length_141_cov_2.468085
CCGAGATCTACACTAGATCGCTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGGAATGGGATGGAATGGAATGGAATGAAATAGAGATGAGTGGGGTGGAGGGGATTGCAGTGGAATGGAGCAGATTGGAATGGGATGGAGTGGAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGTACTAGATCTCGTATGCc

real    0m0.516s
user    0m0.424s
sys    0m0.085s

And here is the timing for extracting contigs > 1000bp

$ time echo "NODE_\d+_length_(\d){4,}_" | FASTAgrep --buffer-size=100000000 contigs.fa | wc -l
3778

real    0m0.939s
user    0m0.787s
sys    0m0.164s

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

1

Entering edit mode

If your IDs.txt only contains one line, grep will be faster. Try a IDs.txt with thousands of IDs/patterns. That is what I don't know but you have not evaluated it (you are still using one pattern). PCRE uses a backtrack regex engine. It is an exponential algorithm. See this page for an example. In practice, PCRE is frequently slower than the "right" regex libraries when you use the | operator.

EDIT: I have tried to extract 19968 IDs from a FASTQ containing 1 million 100bp reads with unique integer IDs. seqtk takes 1.1s, BWK awk 5.9s, gawk 3.7s and bioawk takes 3.8s (bioawk is built upon BWK awk, but it has a faster line-reading engine). Your method takes 10 min before I kill it. Probably my speculation is true - like grep, pcregrep is attempting each pattern against each line in the input. It is performing 19968*1m*4 regex matching. In fact, your method is unable to give the right sequences as it only matches prefixes, not the entire identifiers. On the bright side, your method matches with regex. My approach requires exact identifier match.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by lh3 33k

0

Entering edit mode

Thanks for sharing the page. I really appreciate it! The overarching theme in my case is pattern based matching and for exact matches I would certainly use your recommendations. I purposefully built prefix matches into FASTAgrep and taking .*? out should work. Plus, for 100bp reads I would certainly make the --buffer-size smaller to speed them up!

Well done for pointing this out to me and taking time to QA my code!

Best Wishes,

Umer

ADD REPLY • link 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

Hi lh3, yep, you are right, unfortunately, it doesn't do well at all for extracting thousands of IDs. I tried extracting 19968 IDs from 205591 reads each being 250bp. So, I guess my one-liner is ONLY useful for fewer exact IDs but quite useful for pattern matching.

Thanks once again for pointing this out!

time cat IDs_test.txt | FASTAgrep --buffer-size=350 test.fasta > test_extracted.fasta

real    237m36.978s
user    237m7.311s
sys     0m1.058s

Best Wishes, Umer

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

Dear Umer,

I want to extract a subset of sequences in a number of fasta files (each is a gene with same samples), and I came across with the following code your wrote at your webpage:

cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta

I want to extract those sequences available in the IDs.txt, yet when I run the one-liner code, it only extract one sequence with its id which is the last one in the IDs.txt.

Would you know why it does not print the rest of sequences available in IDs.txt, which are existing in the fasta file.

Thank you
Hossein

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by hosseinv ▴ 20

0

Entering edit mode

Hi, I'm pretty new in Bioinf. and I'd like to know if I can subset fasta.file using a ID.txt that are the sequencies that I'd like to avoid.

Thanks
Ca

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.1 years ago by camilelug • 0