Question

Sequence length of fasta sequences with special characters removed

0

Entering edit mode

3.1 years ago

genomes_and_MGEs ▴ 10

Hi everyone,

When I want to calculate the sequence length of fasta nucleotide files, I use

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

However, my new fasta file has special characters in the nucleotide sequence. Besides the nucleotides A, C, G, and T, my file has the special character 'X'. So, I would like to adapt my code to only count the nucleotides A, C, G, and T, or to exclude the special character 'X' from the count. Can someone help me out?

Thanks!

sequence • 734 views

ADD COMMENT • link updated 3.1 years ago by cpad0112 21k • written 3.1 years ago by genomes_and_MGEs ▴ 10

score 2 · Answer 1 · 2022-03-30

2

Entering edit mode

3.1 years ago

Pierre Lindenbaum 166k

insert {gsub(/[X]/,"");l+=length($0)....

or

sed '/^[^>]/s/X//g' in.fasta | awk...

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot! Both work. In case I have a multi-fasta nucleotide file, and want the sequence length for the whole fasta file, how can I do to prevent outputting the sequence lenght of each fasta sequence?

ADD REPLY • link 3.1 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

$ sed '/^[^>]/s/X//g' in.fasta | seqkit stats 
$ sed '/^[^>]/s/X//g' in.fasta | grep -v "^>" | tr -d '\n' | wc -c
$ bioawk -c fastx '{gsub("X","",$seq); sum+=length($seq)}END{print sum}' in.fasta

ADD REPLY • link 3.1 years ago by cpad0112 21k