Sequence length of fasta sequences with special characters removed
1
0
Entering edit mode
2.7 years ago

Hi everyone,

When I want to calculate the sequence length of fasta nucleotide files, I use

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

However, my new fasta file has special characters in the nucleotide sequence. Besides the nucleotides A, C, G, and T, my file has the special character 'X'. So, I would like to adapt my code to only count the nucleotides A, C, G, and T, or to exclude the special character 'X' from the count. Can someone help me out?

Thanks!

sequence • 627 views
ADD COMMENT
2
Entering edit mode
2.7 years ago

insert {gsub(/[X]/,"");l+=length($0)....

or

sed '/^[^>]/s/X//g' in.fasta | awk...
ADD COMMENT
0
Entering edit mode

Thanks a lot! Both work. In case I have a multi-fasta nucleotide file, and want the sequence length for the whole fasta file, how can I do to prevent outputting the sequence lenght of each fasta sequence?

ADD REPLY
1
Entering edit mode
$ sed '/^[^>]/s/X//g' in.fasta | seqkit stats 
$ sed '/^[^>]/s/X//g' in.fasta | grep -v "^>" | tr -d '\n' | wc -c
$ bioawk -c fastx '{gsub("X","",$seq); sum+=length($seq)}END{print sum}' in.fasta
ADD REPLY

Login before adding your answer.

Traffic: 2761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6