Question

Delete fasta sequence with a pattern "unassigned peptidases"

0

Entering edit mode

8.6 years ago

ayaou • 0

Hello,

I'm searching for a "awk command" to delete all sequences with the pattern "unassigned peptidases" from a file. Here is an example of my data :

>MER0206023 - subfamily S1C unassigned peptidases (Abiotrophia defectiva) [S01.UPC]#S01C#{peptidase unit: 129-408}
GSGIIVDNDDKSLFIITNNHVIEGAKHLMVAFEDGTTAKGEVRGTAAYTDLALVEVKLSE
LDKKVINKIKVAKIGDSDGLKVGQMVMAIGNALGYGQSLTVGYVSARDRIVTVNDITMKL
IQTDAAINPGNSGGALLNLNGEVVGINSVKFSSRAIEGMGYAIPMATVKPLINELKSSKH
LTDTERGYLGIFYREIDDSTHEAFNLPYGLYISDVAKNGGAEKAGLLKGDIIIGLNDNET
LKSDAINSIILGKRKGDKVKVTFYRYENGEYVKHEVTVTL

sequence • 2.2k views

ADD COMMENT • link updated 8.6 years ago by Matt Shirley 10k • written 8.6 years ago by ayaou • 0

score 2 · Answer 1 · 2017-03-15

2

Entering edit mode

8.6 years ago

Pierre Lindenbaum 166k

 awk '/^>/ {OK=index($0,"unassigned peptidases")==0;} {if(OK) print;}' input.fa

ADD COMMENT • link 8.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

can you explain the command ?

ADD REPLY • link 8.6 years ago by 1769mkc ★ 1.3k

2

Entering edit mode

/^>/ {OK=index($0,"unassigned peptidases")==0;}, for a line starting with > (FASTA header line), if it does not contain unassigned peptidases, mark OK as true , and print this line ({if(OK) print;}).
for next lines not starting with > (sequence line), if OK is true, print these lines.
go to Step 1.

ADD REPLY • link 8.6 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks Pierre for showing the power of shell commands. I'm just curious can seqkit reach the speed of awk, but the result makes me surprised and doubtful:

Tests are performed on a fasta file of 2.7G on my laptop with i5 2-cores/4-threads, SSD.

$ seqkit stat dataset_A.fa -a
file          format  type  num_seqs        sum_len  min_len   avg_len    max_len  sum_gap      N50    L50
dataset_A.fa  FASTA   DNA     67,748  2,807,643,808       56  41,442.5  5,976,145        0  210,659  2,477

awk version:

$ time awk '/^>/ {OK=index($0,"whole genome")==0;} {if(OK) print;}' dataset_A.fa > t.awk      
awk '/^>/ {OK=index($0,"whole genome")==0;} {if(OK) print;}' dataset_A.fa >   20.27s user 0.49s system 99% cpu 20.840 tota

seqkit version.

$ time seqkit grep -n -r -p 'whole genome' -v dataset_A.fa > t.seqkit            
seqkit grep -n -r -p 'whole genome' -v dataset_A.fa > t.seqkit  1.87s user 0.54s system 99% cpu 2.409 total

The results are the same:

$ md5sum t.awk t.seqkit 
c592747347ea4ea2e624a03d3360c01a  t.awk
c592747347ea4ea2e624a03d3360c01a  t.seqkit
$ seqkit stat t.awk 
file   format  type  num_seqs      sum_len  min_len      avg_len    max_len
t.awk  FASTA   DNA         54  131,799,401    3,624  2,440,729.6  5,976,145
$ ls -lh t.awk
-rw-rw-r--. 1 shenwei shenwei 128M Mar 16 14:45 t.awk

awk is 10x slower??? And I also tried run su -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free' before run every tests to drop page cache.

ADD REPLY • link 8.6 years ago by shenwei356 8.7k

2

Entering edit mode

awk is 10x slower?

of course. awk is a scripting language . seqkit is compiled and specialized for the job.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2017-03-15

0

Entering edit mode

8.6 years ago

shenwei356 8.7k

using seqkit

seqkit grep -n -r -p 'unassigned peptidases' -v seqs.fa > new.fa

ADD COMMENT • link 8.6 years ago by shenwei356 8.7k

score 0 · Answer 3 · 2017-03-15

0

Entering edit mode

8.6 years ago

Matt Shirley 10k

$ pip install pyfaidx
$ faidx sequences.fa --regex '.*unassigned peptidases.*' --invert-match > no_peptidases.fa

You can find more usage for faidx here: https://github.com/mdshw5/pyfaidx#faidx

ADD COMMENT • link 8.6 years ago by Matt Shirley 10k

0

Entering edit mode

You forget the option -f:

-f, --full-names      output full names including description. default:
                        False

Even using it, it still does not work, since faidx uses FASTA index file but descriptions are not in it.

PS: I'm using pyfaidx==0.4.8.2

Test:

$ faidx t.fa
>seq
actg
$ cat t.fa.fai 
seq     4       11      4       5
$ faidx t.fa --regex s
>seq
actg
$ faidx t.fa --regex o
$ faidx t.fa --regex o -f
$

# ----

rm t.fa.fai
$ faidx t.fa -f --regex s
>seq hello
actg
$ faidx t.fa -f --regex o
$

ADD REPLY • link 8.6 years ago by shenwei356 8.7k