Entering edit mode
7.7 years ago
ayaou
•
0
Hello,
I'm searching for a "awk command" to delete all sequences with the pattern "unassigned peptidases" from a file. Here is an example of my data :
>MER0206023 - subfamily S1C unassigned peptidases (Abiotrophia defectiva) [S01.UPC]#S01C#{peptidase unit: 129-408}
GSGIIVDNDDKSLFIITNNHVIEGAKHLMVAFEDGTTAKGEVRGTAAYTDLALVEVKLSE
LDKKVINKIKVAKIGDSDGLKVGQMVMAIGNALGYGQSLTVGYVSARDRIVTVNDITMKL
IQTDAAINPGNSGGALLNLNGEVVGINSVKFSSRAIEGMGYAIPMATVKPLINELKSSKH
LTDTERGYLGIFYREIDDSTHEAFNLPYGLYISDVAKNGGAEKAGLLKGDIIIGLNDNET
LKSDAINSIILGKRKGDKVKVTFYRYENGEYVKHEVTVTL
can you explain the command ?
/^>/ {OK=index($0,"unassigned peptidases")==0;}
, for a line starting with>
(FASTA header line), if it does not containunassigned peptidases
, markOK
astrue
, and print this line ({if(OK) print;}
).>
(sequence line), ifOK
is true, print these lines.Thanks Pierre for showing the power of shell commands. I'm just curious can seqkit reach the speed of
awk
, but the result makes me surprised and doubtful:Tests are performed on a fasta file of 2.7G on my laptop with i5 2-cores/4-threads, SSD.
awk
version:seqkit
version.The results are the same:
awk
is 10x slower??? And I also tried runsu -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free'
before run every tests to drop page cache.of course. awk is a scripting language . seqkit is compiled and specialized for the job.