Counting softmasked bases in a FASTA
2
0
Entering edit mode
2.3 years ago
Timotheus ▴ 40

Hello,

I've got a softmasked assembly (FASTA) that I will filter in various ways (e.g., remove/truncate contigs). How could I count the number of softmasked bases at the end? Thought toolkits like bioawk or seqkit could be helpful, but didn't find a solution.

Thanks in advance!

fasta • 1.1k views
ADD COMMENT
2
Entering edit mode
2.3 years ago
grep -v '^>' in.fa |  tr -d --complement 'atgc'  | wc -c
ADD COMMENT
0
Entering edit mode

ah sorry, I didn't see the "..at the end"

ADD REPLY
1
Entering edit mode
2.3 years ago

It would have been helpful, if you had provided a sample, but I presume you are thinking of something like this:

>chr1
AAAAAaatttaccCCCtagatgaCCCCCCCGCTACTGGGGGGGGGGGGGGgggtaacatcaaat

And now you would like to count exclusively the number of softmasked based at the 3' prime end?

Your idea using seqkit was actually pretty good...

seqkit locate -p "[acgtn]+$" -r -P example.fasta

will return

seqID   patternName pattern strand  start   end matched
chr1    [acgtn]+$   [acgtn]+$   +   51  64  gggtaacatcaaat

which you can pipe to awk to count the length:

seqkit locate -p "[acgtn]+$" -r -P example.fasta | awk 'NR>1{print $1,$6-$5}'
ADD COMMENT
1
Entering edit mode

My apologies: by 'at the end' I meant after the different filtering steps. I added this because I know the proportion of originally masked bases from Repeat Masker output. Sorry for not being clearer!

ADD REPLY

Login before adding your answer.

Traffic: 2163 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6