I'm following the pointers in this post, regarding the use of Rsamtools and the indexFa()
function to index a large fasta file of around 7GB, about 2000 nt sequences. However only the first ~87 are indexed, I see no reason why it would stop at this point unless it reaches a limitation.
What's more, the file the function is operating on, indexFa("seqs.fa")
, becomes truncated to these first 87 entries also (I don't know how the function works, but that was quite unexpected). Any solution to this or other methods of indexing a large fasta file to call in sequences by name to avoid trying to read in the whole fasta file with read.fasta()
in one go (which gets Killed prematurely too).
Thanks again,
library(Rsamtools)
indexFa("seqs.fa") # I think this is stumbling point
fa = FaFile("seqs.fa")
gr = as(seqinfo(fa), "GRanges")
getSeq(fa, gr[2:1])
idx = pmatch("296783888", names(gr))
seq = getSeq(fa, gr[idx])
Thanks Antonio, on 2. would "N"'s qualify as non-nucleic acid? What's the standard approach to deal with N: remove or mutate?
"N" (ambiguous base call) is perfectly acceptable in FASTA DNA sequences.
Okay that's not the reason then. Nor is variable length.