Entering edit mode
3.2 years ago
rubic
▴
270
Hi,
I'm using Biostrings
readDNAStringSet
to read the human genome fasta file : GRCh38.p13.genome.fa (search.genome.fn
in the code below).
search.genome.set <- Biostrings::readDNAStringSet(search.genome.fn)
Then looking at the starts, ends, and widths of the canonical chromosomes (1-22, X, Y, and M):
> as.data.frame(search.genome.set@ranges)[1:25,]
start end width names
1 1 248956422 248956422 chr1
2 1 242193529 242193529 chr2
3 1 198295559 198295559 chr3
4 1 190214555 190214555 chr4
5 1 181538259 181538259 chr5
6 1 170805979 170805979 chr6
7 1 159345973 159345973 chr7
8 1 145138636 145138636 chr8
9 1 138394717 138394717 chr9
10 1 133797422 133797422 chr10
11 1 135086622 135086622 chr11
12 135086623 268361931 133275309 chr12
13 1 114364328 114364328 chr13
14 114364329 221408046 107043718 chr14
15 1 101991189 101991189 chr15
16 101991190 192329534 90338345 chr16
17 1 83257441 83257441 chr17
18 83257442 163630726 80373285 chr18
19 163630727 222248342 58617616 chr19
20 1 64444167 64444167 chr20
21 64444168 111154150 46709983 chr21
22 111154151 161972618 50818468 chr22
23 1 156040895 156040895 chrX
24 156040896 213268310 57227415 chrY
25 213268311 213284879 16569 chrM
As you can see several chromosomes have starts that are not 1, such as chr19.
I don't understand how Biostrings
determines these start coordinates that are not 1's because it's only reading in the fasta sequences.
I thought this might be because these chromosomes start with N's:
> search.genome.set[19]
A DNAStringSet instance of length 1
width seq names
[1] 58617616 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr19
But chromosomes with start = 1 als start with Ns:
> search.genome.set[15]
A DNAStringSet instance of length 1
width seq names
[1] 101991189 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN chr15
Any idea what this behavior is about?