Question

Biostrings readDNAStringSet handling of N bases

1

Entering edit mode

3.6 years ago

rubic ▴ 270

Hi,

I'm using Biostrings readDNAStringSet to read the human genome fasta file : GRCh38.p13.genome.fa (search.genome.fn in the code below).

search.genome.set <- Biostrings::readDNAStringSet(search.genome.fn)

Then looking at the starts, ends, and widths of the canonical chromosomes (1-22, X, Y, and M):

> as.data.frame(search.genome.set@ranges)[1:25,]
       start       end     width names
1          1 248956422 248956422  chr1
2          1 242193529 242193529  chr2
3          1 198295559 198295559  chr3
4          1 190214555 190214555  chr4
5          1 181538259 181538259  chr5
6          1 170805979 170805979  chr6
7          1 159345973 159345973  chr7
8          1 145138636 145138636  chr8
9          1 138394717 138394717  chr9
10         1 133797422 133797422 chr10
11         1 135086622 135086622 chr11
12 135086623 268361931 133275309 chr12
13         1 114364328 114364328 chr13
14 114364329 221408046 107043718 chr14
15         1 101991189 101991189 chr15
16 101991190 192329534  90338345 chr16
17         1  83257441  83257441 chr17
18  83257442 163630726  80373285 chr18
19 163630727 222248342  58617616 chr19
20         1  64444167  64444167 chr20
21  64444168 111154150  46709983 chr21
22 111154151 161972618  50818468 chr22
23         1 156040895 156040895  chrX
24 156040896 213268310  57227415  chrY
25 213268311 213284879     16569  chrM

As you can see several chromosomes have starts that are not 1, such as chr19.

I don't understand how Biostrings determines these start coordinates that are not 1's because it's only reading in the fasta sequences.

I thought this might be because these chromosomes start with N's:

> search.genome.set[19]
  A DNAStringSet instance of length 1
       width seq                                                                                                          names               
[1] 58617616 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr19

But chromosomes with start = 1 als start with Ns:

> search.genome.set[15]
  A DNAStringSet instance of length 1
        width seq                                                                                                         names               
[1] 101991189 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr15

Any idea what this behavior is about?

Biostrings • 680 views

ADD COMMENT • link 3.6 years ago by rubic ▴ 270