Question

how to remove N loci in a global alignment for all strains

0

Entering edit mode

5.4 years ago

yreynaud • 0

Hi everyone, I have long sequence alignment with lots o N corresponding to recombination events that I want to remove for all strains. like this:

>seq1
ATTNNC
>seq2
NAAGGC

I want to get this:

>seq1
TTC
>seq2
AAC

Any idea to do it automatically???? Thanks in advance ;o)

alignment • 1.2k views

ADD COMMENT • link updated 5.4 years ago by Sishuo Wang ▴ 230 • written 5.4 years ago by yreynaud • 0

1

Entering edit mode

Do all alignments have the same length? if it is, you can try this

grep -v ">" test.fa  | grep -aob 'N' | awk 'BEGEIN{FS=OFS=":"}{print ($1 + 1) % 7}' | sort -k 1,1n | uniq | paste -s -d ',' | xargs -I {} awk '{if(/>.*/) {print}  else {system("echo "$0" | cut --complement -c {}")}}' test.fa

test.fa

>seq1
ATTNNC
>seq2
NAAGGC

output:

>seq1
TTC
>seq2
AAC

Note the number after "%" (which is 7 here) should be the number of nucleotides + 1

The idea is:

grep -v ">": remove lines with ">", so the remaining lines are all sequences

grep -aob 'N': get the bytes offset of all 'N' (0-based)

awk 'BEGEIN{FS=OFS=":"}{print ($1 + 1) % 7}': extract the offset number, then +1 (to 1-based) and get the position of each N per line (%7)

sort -k 1,1n | uniq: filter the duplicated position, the result should be all the positions that N has at least one occurrence

paste -s -d ',': concatenate the number as the parameters in the next step

xargs -I {} awk '{if(/>.*/) {print} else {system("echo "$0" | cut --complement -c {}")}}' test.fa: final step, if a line starts with ">", print; else extract the characters located in the positions N has nevert occurred

ADD REPLY • link 5.4 years ago by Jianyu ▴ 580

score 0 · Answer 1 · 2019-12-09

0

Entering edit mode

5.4 years ago

Sishuo Wang ▴ 230

Maybe trimseq?

ADD COMMENT • link 5.4 years ago by Sishuo Wang ▴ 230