split files in Linux with pattern match
1
0
Entering edit mode
7.1 years ago
skjobs1234 ▴ 40

I have a file contents with specific pattern, I would like to split that file into multiple file after pattern match and file name should be with after pattern match words Examples

P1_1r6r

NRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINV
LRGFRKEIGRMLNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIP

P1_1sfk

MALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRMLNILNRRRRRVSTVQQ LTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEI

P1_12562

RFSLPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRM LNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTI

So, here pattern is P1, I want to split the above file into 3 different files contenst with file name like 1r6r,1sfk,12562.

Thanks

sequence • 2.7k views
ADD COMMENT
2
Entering edit mode

your input format is not clear . is it fasta ?

ADD REPLY
2
Entering edit mode

with awk and sed: Input:

$ cat test.txt 
P1_1r6r
NRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINV
LRGFRKEIGRMLNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIP
P1_1sfk
MALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRMLNILNRRRRRVSTVQQ LTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEI
P1_12562
RFSLPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRM LNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTI

command:

 $ sed -e 'N;s/\n/\t/;s/^P.*_//g'  test.txt | awk -F"\t" '{print $2 > $1}'

output:

$ ls
12562  1r6r  1sfk   test.txt

$ cat 12562 
RFSLPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEI
GRMLNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTI
PPTAGILKRWGTI

Note: All AA are in single line post identifier (each 2nd line after identifier)

ADD REPLY
2
Entering edit mode
7.1 years ago

Maybe this is the desired output?

File: 1r6r

NRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINV
LRGFRKEIGRMLNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIP

File: 1sfk

MALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRMLNILNRRRRRVSTVQQ 
LTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEI

File: 12562

RFSLPLKLFMALVAFLRFLTIPPTAGILKRWGTIKKSKAINVLRGFRKEIGRM 
LNILNRRRRRVSTVQQLTKRFSLGMLQGRGPLKLFMALVAFLRFLTIPPTAGILKRWGTI

Assuming that the data is in MyProtein.fasta, this can produce this output (assuming FASTA headers as '>P1_1r6r', '>P1_1sfk', et cetera):

awk -F"_" '/^>P1/ {file=$2; printf "" > file}; !/^>P1/ {print >> file}' MyProtein.fasta

If the headers are just 'P1_1r6r', 'P1_1sfk', et cetera' (without the greater than symbol):

awk -F"_" '/^P1/ {file=$2; printf "" > file}; !/^P1/ {print >> file}' MyProtein.fasta
ADD COMMENT
1
Entering edit mode

If cpad0112's solution works, too, then let me know so that I can move it to an answer (or you can just upvote it to account for his/her efforts in helping out).

Also, if Pierre's comments were helpful, it would be beneficial to upvote them too.

ADD REPLY

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6