Remove site with only missing data
1
0
Entering edit mode
3 days ago
Leane • 0

Hi! I have a multi-individual fasta file, and from this file I'm trying to remove all sites (i.e. nucleotides at the same position in the different fasta files) that contain only Ns. Do you know how I can do this? Thanks

Fasta • 317 views
ADD COMMENT
0
Entering edit mode

that's not clear to me. Give us an example.

ADD REPLY
0
Entering edit mode

For example, if I have 4 individuals in my files:

>seq1
AAATTTGNNNNNNCN
>seq2
AAAATTCNNNNNGCA
>seq3
AAATTAGNNNANGCA
>seq4
AAATTACNNNNNNCA

I want to remove positions with 'N' from all my individuals' sequences, and obtain something like this:

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA

And keep my alignment intact.

ADD REPLY
1
Entering edit mode
2 days ago

assuming two lines per fasta record.

linearize the fasta, print one base per sequence, get the places where there is a N, count the places where there is 4 'N'. create a pattern for cut

CUT=`cat input.fa | paste - - | cut -f2 | while read S; do echo "${S}" | fold -w1 | awk '($1=="N") {print NR}' ; done  | sort | uniq -c | awk '($1==4) {print $2;}' | paste -sd','`

cat input.fa | while read N; do echo $N; read S ; echo "${S}" | cut --complement -c "${CUT}" ; done

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA
ADD COMMENT
0
Entering edit mode

This script fails to detect the CUT object (which is therefore empty) and as a result, it cannot find the positions where all individuals have missing data (N). Sorry, I'm a beginner, but is it because I have long sequences that span more than one line? I've tried concatenating my sequences, and it still doesn't work. I also tried extracting the headers and sequences of the file using awk (with this command: awk 'BEGIN {RS=">"; ORS=""; FS="\n"} {if (NF>1) {print $2}}'), but with no result. Do you know how to fix my issue? Thanks in advance!

ADD REPLY
0
Entering edit mode

e I have long sequences that span more than one line?

yes, see :

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60
view raw README.md hosted with ❤ by GitHub
/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
{printf("%s",$0);}
END {printf("\n");}
to convert a fasta to two-lines records

ADD REPLY

Login before adding your answer.

Traffic: 1429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6