Htslib c++ library faidx fetch region

1

Entering edit mode

4.1 years ago

rah ▴ 30

I'm coding in c++ and reading in different reference genomes to examine regions across the chromosomes a few hundred base pairs at a time.

To do it in c++ I use the library htslib and the command faidx_fetch_seq() which is similar to read specific regions using samtools faidx.

The problem is when I'm reading in the reference genomes, there are of course multiple regions consisting only of NNNNN (and so on). Since I only need the actual DNA sequence I wish to read in the entire file but removing the regions with NNNN.

I have tried reading in the regions in c++ and then just creating an if-statement to check the actual bases consist of N. This works but is incredibly slow.

So my question is: Does anyone know how to only read in the actual DNA sequences, removing the NNN regions? Either with samtools or with the c++ function faidx_fetch_seq().

samtools cpp faidx htslib • 873 views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 4.1 years ago by rah ▴ 30

Login before adding your answer.