Question

Print line based on partial match

0

Entering edit mode

7.2 years ago

leo1985.arnab ▴ 50

I have two files with several hundred entries in each. File 1 has several 5 base seqeunces and file 2 has higher number of entries but with longer sequences. The first 5 bases of sequences in file 2 matches that of file 1. I tried some grep and awk methods , but did not work out for a partial match case as above. So for example:

File 1:

       ATGCC
       TTGCA
       GGAAC

........
........

File 2:

ATTTCGGGAAAATT
ATGCCTTAAGACCT
GGAACTAAGGGGA
............
............

Expected outcome:

ATGCCTTAAGACCT
GGAACTAAGGGGA

Any help is much appreciated ! Thanks !

sequence • 3.6k views

ADD COMMENT • link updated 3.2 years ago by Boggarapu • 0 • written 7.2 years ago by leo1985.arnab ▴ 50

score 0 · Answer 1 · 2017-09-05

0

Entering edit mode

7.2 years ago

shenwei356 8.7k

grep -f short.seq.file long.seq.file

ADD COMMENT • link 7.2 years ago by shenwei356 8.7k

0

Entering edit mode

Shenwei, thanks for the reply. But I already tried that grep option before posting the topic. It didn't work.

ADD REPLY • link 7.2 years ago by leo1985.arnab ▴ 50

0

Entering edit mode

It definitely will work, but you have to put ^ in front of the 5 letter sequences in File1 ...

^ATGCC
^TTGCA
^GGAAC

If you don't want to use grep then any program that will separate based on user-defined barcodes - flexbar / etc - will do this for you.

ADD REPLY • link 7.2 years ago by george.ry ★ 1.2k

score 0 · Answer 2 · 2017-09-05

0

Entering edit mode

7.2 years ago

Sparrow_kop ▴ 260

Hi, because "The first 5 bases of sequences in file 2 matches that of file 1. " So 'grep -f file1 file2' is not so robust, because the pattern may be exist in other location other than the first 5 base. So you can use regular expression in bash :

#!/bin/bash

cat file1.txt | while read pattern
do 
    grep "^$pattern" file2.txt
done

ADD COMMENT • link 7.2 years ago by Sparrow_kop ▴ 260

0

Entering edit mode

Sparrow_kop the script is working. Thanks ! But only if the sequences are in the same order in both files. I did mistakenly write previously that the total number of sequences in 2 files are identical, actually they are not. Apologies. File 2 with the larger sequences has many many more sequences. But either way, is there a way to by pass the order in the search? Sorting probably may not be a good idea with sequences.

ADD REPLY • link 7.2 years ago by leo1985.arnab ▴ 50

0

Entering edit mode

Hi, I think I don't get it , what's the meaning of 'same order', you mean you want match the reverse complementation? Or you means the sequence order, for example the alphabetical order？ If it is the latter one, the order does not matter, because for each loop, grep will match the pattern on the whole sequences in file2, so you need not to sort it. Also it is ok that the total number of sequences in 2 files are not identical.

ADD REPLY • link 7.2 years ago by Sparrow_kop ▴ 260

0

Entering edit mode

This can also be written as (without cat):

while read pattern ; do grep "^$pattern" file2.txt ; done < file1.txt

ADD REPLY • link 7.2 years ago by Joe 21k

0

Entering edit mode

Hey one thing i want to ask. I'm supposed to store every line in a file as n number of patterns and match those n patterns with every line of file2. Can you tell me how to do this?

ADD REPLY • link 3.2 years ago by Boggarapu • 0