if I understood correctly, you are assuming you are trying to avoid the base "N" in any of the first 5 bases of the sequences on the 1st and 2nd column. if that's the case, here are a few ideas written into oneliners that will do the job.
this one prints all lines where the first 5 bases are not anything not-N (so it actually looks for an N) in the 1st or 2nd columns:
perl -lane 'print if $F[0] !~ /^[^N]{5}/ or $F[1] !~ /^[^N]{5}/' test.txt
this one looks for the N position itself, and prints the line if the N is found in the first 5 bases or if it is not found:
perl -lane 'print if index($F[0],"N") < 5 and index($F[1],"N") < 5' test.txt
this one looks for an N in a string made up of the first 5 bases of the 1st and the 2nd columns:
perl -lane $s = substr($F[0],0,5).substr($F[1],0,5); print if $s =~ /N/' test.txt
this one (which should be the fastest) looks for an N preceded by a less than 4 bases sequence, where the \b
represents a word boundary and \S
any non-blank character (which could be forced to [ACGT]
to strictly look for known bases):
perl -ne 'print if /\b\S{0,4}N/' test.txt
finally, this is probably the simplified awk alternative to the last perl idea you were looking for, where the \y
represents the word boundary:
awk '/\y\S{0,4}N/' test.txt
I just wanted to point out that there are always multiple ways to reach your goals, and that you don't necessarily need to stop thinking about how to do a particular thing even if you already found an answer. you always have to consider how easy it is to find out other solutions (to invest time and not to waste it), how well will they perform, how robust they are,...
Something like this may work (I did not test it).