So I have a gff that has information about MEF2 transcription factor binding sites. So given a start and end site, 19815641 - 19815654 on the + strand on ChrX, where exactly do I get the sequence from?
I have 1800 lines in the gff file, so I cant do it manually. I am looking for a R solution, so basically if something like the following function exists
getSequence(start, end, strand, chr)
The goal is to create a PWM so my next question is that once I've retrieved my sequences, how do I go about aligning them? what is the best software to align 1800 short sequences?
Edit: It seems like the
bedtools getfasta -fi reference.fasta -bed gff.file -fo output.fasta
is what I need, but whats the easiest way to download hg18 reference genome?
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
You can use
samtools faidx
to retrieve sequences from reference genome using coordinates.