As a general example, you could get a GFF file of annotations via GENCODE, and convert GFF to BED via BEDOPS gff2bed. If you're working with a recent build of the human genome, for instance:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| gff2bed - \
> annotations.bed
BEDOPS bedops and bedmap facilitate making associations between BED-formatted reads and annotations. You could do bedops -e 1, for example, to get reads that overlap annotations. You could do further filtering with awk or grep to get subsets of these annotations, such as exons or genes.
To do sequence lookups, you could use samtools faidx against a directory of indexed FASTA files for your genome of interest (for instance, UCSC makes FASTA files available to download for human and other builds).
An example Perl script could parse a BED file containing reads-of-interest into sequence data, via indexed FASTA files:
#!/usr/bin/env perl
use strict;
use warnings;
# contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz";
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
my $queryKey = "$chr:$start-$stop";
my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
if ($strand eq "-") { $seq = revdnacomp($seq); }
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".uc($seq)."\n";
}
sub revdnacomp {
my $dna = shift @_;
my $revcomp = reverse($dna);
$revcomp =~ tr/ACGTacgt/TGCAtgca/;
return $revcomp;
}
Hopefully, these are useful starting points.
Unfortunately, it seems that a bed file produced by gff2bed is not how it should look like. I've compared bed fields from such files with those provided by the mentioned simulator's examples, and they differ quite a lot so I can't use the simulator on gff2bed files.
This is how simulator expects them to be: http://www.deviantpics.com/STx
This is how gtf2bed produces them: http://www.deviantpics.com/STr
Hi,
Can you write a simple 'awk' script to convert the resultant files into the format you want?
Probably, If I knew what fields correspond to which in these two file formats. I am not very familiar with variations between bed files.
file you want is bed12 format,
gtf2bed has produced bed file, without prefix "chr"
you can add it
you can check UCSC bed12 format, I am sure you can convert it
What I've noticed is that the simulator returns an error trying to convert field number 10 to exon length. I don't see any such corresponding field in this bed3 file.
The documentation for gff2bed shows a table of how GFF fields are mapped to BED columns/indices.