I have the refgene regions of the rattus norvegicus chromosomes, given by the indexes of the start and end positions.
I also have the chromosomes of interest as fasta files.
Example:
> chr11
aaactaatcgtcttggcaccaaaacaaagagaatgaaagcacacaaacat
aacctcacatccaaatatgaatataaagggaaacaataatcactattcct
caatcctaaatatctatgccccaaatacaagggcacctacatacgtaaaa
What I want to do is to pick out the refgene regions of the chromosome files.
The way I do it now is simply to load the chromosome into a string in Python like this:
chrom_string = ''' '''.strip()
for line in input_file:
chrom_string +=line.rstrip()
Then I pick out the regions of interest by substring indexing:
chrom_string[current_start:current_end]
Problem is, doing it this way I get plenty of fasta reads like
>chr1+758657
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNgagagagagagagagagagagagagag`
I guess it is unlikely that a known refgene region should contain N's so I must be doing something wrong.
Is there a library that does what I want to do?
I wouldn't be too surprised if you were using a masked genome. If so, you could try rn5.2bit version found on UCSC