Biostrings readDNAStringSet handling of N bases
0
1
Entering edit mode
3.2 years ago
rubic ▴ 270

Hi,

I'm using Biostrings readDNAStringSet to read the human genome fasta file : GRCh38.p13.genome.fa (search.genome.fn in the code below).

search.genome.set <- Biostrings::readDNAStringSet(search.genome.fn)

Then looking at the starts, ends, and widths of the canonical chromosomes (1-22, X, Y, and M):

> as.data.frame(search.genome.set@ranges)[1:25,]
       start       end     width names
1          1 248956422 248956422  chr1
2          1 242193529 242193529  chr2
3          1 198295559 198295559  chr3
4          1 190214555 190214555  chr4
5          1 181538259 181538259  chr5
6          1 170805979 170805979  chr6
7          1 159345973 159345973  chr7
8          1 145138636 145138636  chr8
9          1 138394717 138394717  chr9
10         1 133797422 133797422 chr10
11         1 135086622 135086622 chr11
12 135086623 268361931 133275309 chr12
13         1 114364328 114364328 chr13
14 114364329 221408046 107043718 chr14
15         1 101991189 101991189 chr15
16 101991190 192329534  90338345 chr16
17         1  83257441  83257441 chr17
18  83257442 163630726  80373285 chr18
19 163630727 222248342  58617616 chr19
20         1  64444167  64444167 chr20
21  64444168 111154150  46709983 chr21
22 111154151 161972618  50818468 chr22
23         1 156040895 156040895  chrX
24 156040896 213268310  57227415  chrY
25 213268311 213284879     16569  chrM

As you can see several chromosomes have starts that are not 1, such as chr19.

I don't understand how Biostrings determines these start coordinates that are not 1's because it's only reading in the fasta sequences.

I thought this might be because these chromosomes start with N's:

> search.genome.set[19]
  A DNAStringSet instance of length 1
       width seq                                                                                                          names               
[1] 58617616 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr19

But chromosomes with start = 1 als start with Ns:

> search.genome.set[15]
  A DNAStringSet instance of length 1
        width seq                                                                                                         names               
[1] 101991189 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr15

Any idea what this behavior is about?

Biostrings • 587 views
ADD COMMENT

Login before adding your answer.

Traffic: 1204 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6