Entering edit mode
8.6 years ago
ayaou
•
0
Hello,
I'm searching for a "awk command" to delete all sequences with the pattern "unassigned peptidases" from a file. Here is an example of my data :
>MER0206023 - subfamily S1C unassigned peptidases (Abiotrophia defectiva) [S01.UPC]#S01C#{peptidase unit: 129-408}
GSGIIVDNDDKSLFIITNNHVIEGAKHLMVAFEDGTTAKGEVRGTAAYTDLALVEVKLSE
LDKKVINKIKVAKIGDSDGLKVGQMVMAIGNALGYGQSLTVGYVSARDRIVTVNDITMKL
IQTDAAINPGNSGGALLNLNGEVVGINSVKFSSRAIEGMGYAIPMATVKPLINELKSSKH
LTDTERGYLGIFYREIDDSTHEAFNLPYGLYISDVAKNGGAEKAGLLKGDIIIGLNDNET
LKSDAINSIILGKRKGDKVKVTFYRYENGEYVKHEVTVTL
can you explain the command ?
/^>/ {OK=index($0,"unassigned peptidases")==0;}, for a line starting with>(FASTA header line), if it does not containunassigned peptidases, markOKastrue, and print this line ({if(OK) print;}).>(sequence line), ifOKis true, print these lines.Thanks Pierre for showing the power of shell commands. I'm just curious can seqkit reach the speed of
awk, but the result makes me surprised and doubtful:Tests are performed on a fasta file of 2.7G on my laptop with i5 2-cores/4-threads, SSD.
awkversion:seqkitversion.The results are the same:
awkis 10x slower??? And I also tried runsu -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free'before run every tests to drop page cache.of course. awk is a scripting language . seqkit is compiled and specialized for the job.