Question

Remove Gaps from multiple Alignment

0

Entering edit mode

7.2 years ago

misbahabas ▴ 70

Asslam u Alikum

I want to remove gaps from multiple sequence alignment , i tried many tools and scripts but it works in different ways.

H= -----ATGCGTACG-TGCA--C
M= ATGCCCGATCGCH----ATC

I want to remove gaps according to reference like if gaps is present in first line (reference) remove that column like this

H= ATGCGTACGTGC
M= CCCGATCGC----C

but different tools remove all gaps columns from whole file. please tell me any idea about it

alignment command sequence • 5.4k views

ADD COMMENT • link updated 7.2 years ago by Joe 21k • written 7.2 years ago by misbahabas ▴ 70

0

Entering edit mode

please check your alignment, your expected output doesn't make any sense. 'M=' should end with 'T'

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

sorry it just a example both sequence have same lengths sorry again

>1
------------mfelaeySGLL---TLFL-IASFPIFT-SPIG---

>2
------------mfelsgyAVLLFFMVIFL-VASFPLLS-SPIG---

>3
--------MKLF-------TFLLFFL-LLC-LSMIPLLS-SPVS---

i want to remove gaps column according to >1 sequence

ADD REPLY • link updated 7.2 years ago by Pierre Lindenbaum 164k • written 7.2 years ago by misbahabas ▴ 70

0

Entering edit mode

and what is 'H' here ??? what is the expected output ?

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

input

HUMAN= --ATGC-C

MOUSE= ATGC--T-

I just a example input, Now i want output like this

HUMAN= ATGCC

MOUSE= GC---

I want to remove gap columns according to reference not between all the file, here human is reference

ADD REPLY • link updated 7.2 years ago by Pierre Lindenbaum 164k • written 7.2 years ago by misbahabas ▴ 70

score 0 · Answer 1 · 2017-09-28

Easiest way to proceed, to my mind is to do the following (fair warning though, this is dirty):

Split the multifasta up first:

#!/bin/bash
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < myalignment.fasta

Assuming your human reference sequence is always the first one, this script will generate a file for each, e.g. seq1.fasta, seq2.fasta, seq3.fasta etc.

This means your reference should always be the first seq1.fasta.

Then to remove the gaps, you can simply do sed -i 's/-//g' seq1.fasta.

Then concatenate them all back together if you need a multifasta again:

cat seq*.fasta > edited_multifasta.fasta