I'm coding in c++ and reading in different reference genomes to examine regions across the chromosomes a few hundred base pairs at a time.
To do it in c++ I use the library htslib and the command faidx_fetch_seq()
which is similar to read specific regions using samtools faidx
.
The problem is when I'm reading in the reference genomes, there are of course multiple regions consisting only of NNNNN
(and so on). Since I only need the actual DNA sequence I wish to read in the entire file but removing the regions with NNNN
.
I have tried reading in the regions in c++ and then just creating an if-statement to check the actual bases consist of N. This works but is incredibly slow.
So my question is: Does anyone know how to only read in the actual DNA sequences, removing the NNN
regions? Either with samtools or with the c++ function faidx_fetch_seq()
.