Entering edit mode
8.0 years ago
francois
▴
80
In R Studio: I want to extract all sequences from a chromosome within -2000 and +500 of the TSS.
I have already computed the start/end positions of all the sequences I want (let's only consider the plus strand for now):
promoters_plus <- data.frame (chr = 'chr21', prom_start = plus$start - 2000, prom_end = plus$start + 500, strand = '+', id = plus$id)
It am able to extract specific sequences manually like this (for example with the 5th row):
start = promoters_plus$prom_start [5]
stop = promoters_plus$prom_end [5]
subseq (chr21, start, stop)
But I do not know how to correctly loop to get all of them (ideally in a dataframe)
I have tried naively doing this:
for (i in 1:nrow (promoters_plus)) {
start = promoters_plus$prom_start[i]
stop = promoters_plus$prom_end[i]
seq[i] = subseq (chr21, start, stop)
}
But it returns this error:
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript contains NAs or out-of-bounds indices
Can you help?
What is chr21 ? Is it a GRanges object ?
No, a DNAString object (the sequence of chromsome 21)
This would be so easy using
bedtools getfasta
, so if you don't absolutely want to use R for it...Hi,
I'm facing a similar issue. Were you able to resolve the error?
You could use
samtools faidx
-indexed FASTA files with a BED-to-FASTA script. For an example, see: A: Searching promoter consensus in a gene