how to remove N loci in a global alignment for all strains
1
0
Entering edit mode
5.0 years ago
yreynaud • 0

Hi everyone, I have long sequence alignment with lots o N corresponding to recombination events that I want to remove for all strains. like this:

>seq1
ATTNNC
>seq2
NAAGGC

I want to get this:

>seq1
TTC
>seq2
AAC

Any idea to do it automatically???? Thanks in advance ;o)

alignment • 1.1k views
ADD COMMENT
1
Entering edit mode

Do all alignments have the same length? if it is, you can try this

grep -v ">" test.fa  | grep -aob 'N' | awk 'BEGEIN{FS=OFS=":"}{print ($1 + 1) % 7}' | sort -k 1,1n | uniq | paste -s -d ',' | xargs -I {} awk '{if(/>.*/) {print}  else {system("echo "$0" | cut --complement -c {}")}}' test.fa

test.fa

>seq1
ATTNNC
>seq2
NAAGGC

output:

>seq1
TTC
>seq2
AAC

Note the number after "%" (which is 7 here) should be the number of nucleotides + 1

The idea is:

  1. grep -v ">": remove lines with ">", so the remaining lines are all sequences
  2. grep -aob 'N': get the bytes offset of all 'N' (0-based)
  3. awk 'BEGEIN{FS=OFS=":"}{print ($1 + 1) % 7}': extract the offset number, then +1 (to 1-based) and get the position of each N per line (%7)
  4. sort -k 1,1n | uniq: filter the duplicated position, the result should be all the positions that N has at least one occurrence
  5. paste -s -d ',': concatenate the number as the parameters in the next step
  6. xargs -I {} awk '{if(/>.*/) {print} else {system("echo "$0" | cut --complement -c {}")}}' test.fa: final step, if a line starts with ">", print; else extract the characters located in the positions N has nevert occurred
ADD REPLY
0
Entering edit mode
5.0 years ago
Sishuo Wang ▴ 230

Maybe trimseq?

ADD COMMENT

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6