Remove site with only missing data

0

Entering edit mode

3 days ago

Leane • 0

Hi! I have a multi-individual fasta file, and from this file I'm trying to remove all sites (i.e. nucleotides at the same position in the different fasta files) that contain only Ns. Do you know how I can do this? Thanks

Fasta • 317 views

ADD COMMENT • link updated 2 days ago by Pierre Lindenbaum 165k • written 3 days ago by Leane • 0

0

Entering edit mode

that's not clear to me. Give us an example.

ADD REPLY • link 3 days ago by Pierre Lindenbaum 165k

0

Entering edit mode

For example, if I have 4 individuals in my files:

>seq1
AAATTTGNNNNNNCN
>seq2
AAAATTCNNNNNGCA
>seq3
AAATTAGNNNANGCA
>seq4
AAATTACNNNNNNCA

I want to remove positions with 'N' from all my individuals' sequences, and obtain something like this:

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA

And keep my alignment intact.

ADD REPLY • link updated 2 days ago by Pierre Lindenbaum 165k • written 2 days ago by Leane • 0

1

Entering edit mode

2 days ago

Pierre Lindenbaum 165k

assuming two lines per fasta record.

linearize the fasta, print one base per sequence, get the places where there is a N, count the places where there is 4 'N'. create a pattern for cut

CUT=`cat input.fa | paste - - | cut -f2 | while read S; do echo "${S}" | fold -w1 | awk '($1=="N") {print NR}' ; done  | sort | uniq -c | awk '($1==4) {print $2;}' | paste -sd','`

cat input.fa | while read N; do echo $N; read S ; echo "${S}" | cut --complement -c "${CUT}" ; done

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA

ADD COMMENT • link 2 days ago by Pierre Lindenbaum 165k

0

Entering edit mode

This script fails to detect the CUT object (which is therefore empty) and as a result, it cannot find the positions where all individuals have missing data (N). Sorry, I'm a beginner, but is it because I have long sequences that span more than one line? I've tried concatenating my sequences, and it still doesn't work. I also tried extracting the headers and sequences of the file using awk (with this command: awk 'BEGIN {RS=">"; ORS=""; FS="\n"} {if (NF>1) {print $2}}'), but with no result. Do you know how to fix my issue? Thanks in advance!

ADD REPLY • link 2 days ago by Leane • 0

0

Entering edit mode

e I have long sequences that span more than one line?

yes, see :

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

to convert a fasta to two-lines records

ADD REPLY • link 2 days ago by Pierre Lindenbaum 165k

Login before adding your answer.