how to retreive sequences given a start and end site.
0
1
Entering edit mode
10.0 years ago
Affan ▴ 310

So I have a gff that has information about MEF2 transcription factor binding sites. So given a start and end site, 19815641 - 19815654 on the + strand on ChrX, where exactly do I get the sequence from?

I have 1800 lines in the gff file, so I cant do it manually. I am looking for a R solution, so basically if something like the following function exists

getSequence(start, end, strand, chr)

The goal is to create a PWM so my next question is that once I've retrieved my sequences, how do I go about aligning them? what is the best software to align 1800 short sequences?

Edit: It seems like the

bedtools getfasta -fi reference.fasta -bed gff.file -fo output.fasta

is what I need, but whats the easiest way to download hg18 reference genome?

alignment sequence • 2.2k views
ADD COMMENT
1
Entering edit mode

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/

You can use samtools faidx to retrieve sequences from reference genome using coordinates.

faidx samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

ADD REPLY

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6