Entering edit mode
10.7 years ago
umer.zeeshan.ijaz
★
1.8k
Here is the tutorial.
For a change, I have also given detailed explanation of how I construct my one-liners. Btw, I initially wrote the tutorial on biostars but it wouldn't let me post it because of exceeding word limit, so I just hosted it on my website.
Best Wishes,
Umer
Edited 18/4/2014: Updated the tutorial to discuss --buffer-size
and filtering out contigs from velvet assemblies
There is also bioawk and biopieces
Yeah, there are many tools out there that one can use. Infact I was using fastagrep.pl before. I should have emphasised on "in the absence of any software how you grep for records". Furthermore, my intention is to show how powerful pcregrep is (and hence I have put this post as a tutorial and not a tool). Probably on the same principles, using extended perl regular expressions, people can extract records from other formats, e.g., extracting alignments from the output files generated from water and needle in EMBOSS utilities.
Best Wishes, Umer
Good documents on perl regex. However, pcregrep is part of PCRE. It is not a standard Unix tool, either. In addition, PCRE should be used with caution. It is very slow on some patterns due to the defect in its algorithm, much slower than grep. I am also not sure how pcregrep works if we let it match thousands of IDs. Does it tries each pattern on each line of input? That would be very slow. Fgrep is fast because it builds a finte-state machine for the patterns instead of attempting each pattern against each line.
As to pure Unix one-liner - to extract from FASTA:
To extract from 4-line FASTQ:
In daily works, I mostly use
seqtk subseq
instead of these one-liners.(PS: I have problems with the new editor; much prefer the old MarkDown editor)
I don't believe it is "very" slow. I have generated some timings below. Yes, it is not available in most standard distros. I am not using experimental PCRE that should be used with caution and I also encourage others not to use them until they are stable enough. As for your question on how it will work with thousands of IDs, there is an example given at the end of this comment (extracting contigs > 1000bp). Furthermore, I believe that the version I am using is running smoothly, though I'd appreciate it if you could provide me with a reference for the defect that you have mentioned.
I have a contig file with 93587 sequences:
Here is the last entry in the contigs file:
And we will use that in IDs.txt
Here is the timing for what you wrote:
Here is the timing for FASTAgrep
And here is the timing for extracting contigs > 1000bp
If your IDs.txt only contains one line, grep will be faster. Try a IDs.txt with thousands of IDs/patterns. That is what I don't know but you have not evaluated it (you are still using one pattern). PCRE uses a backtrack regex engine. It is an exponential algorithm. See this page for an example. In practice, PCRE is frequently slower than the "right" regex libraries when you use the
|
operator.EDIT: I have tried to extract 19968 IDs from a FASTQ containing 1 million 100bp reads with unique integer IDs. seqtk takes 1.1s, BWK awk 5.9s, gawk 3.7s and bioawk takes 3.8s (bioawk is built upon BWK awk, but it has a faster line-reading engine). Your method takes 10 min before I kill it. Probably my speculation is true - like grep, pcregrep is attempting each pattern against each line in the input. It is performing
19968*1m*4
regex matching. In fact, your method is unable to give the right sequences as it only matches prefixes, not the entire identifiers. On the bright side, your method matches with regex. My approach requires exact identifier match.Thanks for sharing the page. I really appreciate it! The overarching theme in my case is pattern based matching and for exact matches I would certainly use your recommendations. I purposefully built prefix matches into FASTAgrep and taking
.*?
out should work. Plus, for 100bp reads I would certainly make the--buffer-size
smaller to speed them up!Well done for pointing this out to me and taking time to QA my code!
Best Wishes,
Umer
Hi lh3, yep, you are right, unfortunately, it doesn't do well at all for extracting thousands of IDs. I tried extracting 19968 IDs from 205591 reads each being 250bp. So, I guess my one-liner is ONLY useful for fewer exact IDs but quite useful for pattern matching.
Thanks once again for pointing this out!
Best Wishes, Umer
Dear Umer,
I want to extract a subset of sequences in a number of fasta files (each is a gene with same samples), and I came across with the following code your wrote at your webpage:
I want to extract those sequences available in the IDs.txt, yet when I run the one-liner code, it only extract one sequence with its id which is the last one in the IDs.txt.
Would you know why it does not print the rest of sequences available in IDs.txt, which are existing in the fasta file.
Thank you
Hossein
Hi, I'm pretty new in Bioinf. and I'd like to know if I can subset fasta.file using a ID.txt that are the sequencies that I'd like to avoid.
Thanks
Ca