Question

Combine fasta files by matching partial header in specific format

0

Entering edit mode

5.3 years ago

waqaskhokhar999 ▴ 160

I have three fasta files reflecting protein sequences for each gene in xls format (space separated). The first column contains header, while the other column contains sequence. For example:

File1:

sample  1   2   3   4   5   6
BnaA03g18710D   M   A   A   A   V   S
BnaA03g18710D_S25   M   A   A   A   V   S
BnaA03g18710D_S31   M   A   A   A   V   S

File2:

sample  1   2   3   4   5   6
BnaA03g18710D_a M   A   A   A   V   S
BnaA03g18710D_S25_a M   A   A   A   V   S
BnaA03g18710D_S31_a M   A   A   A   V   S

File3:

sample  1   2   3   4   5   6
BnaA03g18710D_b M   A   A   A   V   S
BnaA03g18710D_S25_b M   A   A   A   V   S
BnaA03g18710D_S31_b M   A   A   A   V   S

I am intersted to merge them in the follwoing order:

sample  1   2   3   4   5   6
BnaA03g18710D   M   A   A   A   V   S
BnaA03g18710D_a M   A   A   A   V   S
BnaA03g18710D_b M   A   A   A   V   S
BnaA03g18710D_S25   M   A   A   A   V   S
BnaA03g18710D_S25_a M   A   A   A   V   S
BnaA03g18710D_S25_b M   A   A   A   V   S
BnaA03g18710D_S31   M   A   A   A   V   S
BnaA03g18710D_S31_a M   A   A   A   V   S
BnaA03g18710D_S31_b M   A   A   A   V   S

I have tried cat, sed and other commands but wasn't able to make the desired format. Any help will be highly appreciated.

RNA-Seq • 962 views

ADD COMMENT • link 5.3 years ago by waqaskhokhar999 ▴ 160

1

Entering edit mode

Try to cat them together, and sort them by first column, then remove the sample columns by grep -v 'sample. To get the header line, simply cat the first line of the first file with the output you obtained from the strategy I just described. I am sure you manage to do that.