I will have to make a fasta file, my assignment is to plot the GC content of chr20 found from UCSC genome browser. I have made a GRange object, but I do not know how to proceed to get the fasta file. My GRange object looks like -
GRanges object with 524 ranges and 0 metadata columns:
seqnames ranges strand
<Rle><IRanges><Rle>33 chr20 [61641628,61642150]*107 chr20 [44600801,44601199]*183 chr20 [48292184,48292583]*387 chr20 [19925740,19926166]*432 chr20 [17806836,17807156]*............23289 chr20 [17940061,17940581]*23314 chr20 [24819765,24820285]*23335 chr20 [31559864,31560384]*23404 chr20 [5584130,5584303]*23448 chr20 [49574803,49575323]*-------
seqinfo:23 sequences from an unspecified genome; no seqlengths.
Can anyone help me please? It would be a big help.
Assuming you are working with human genome build hg19, you could build a set of indexed FASTA files with samtools:
$ cd /foo/bar/baz
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz
$ for fn in`ls *.fa.gz`;do gunzip $fn;done
$ for fn in`ls *.fa`;do samtools faidx $fn;done
You could use a helper script like the following Perl script to query the indexed FASTA files you made in step 2, against the BED file you exported in step 1:
#!/usr/bin/env perl
use strict;
use warnings;# fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir="/foo/bar/baz";while(<STDIN>){
chomp;
my ($chr, $start, $stop, $id, $score, $strand)= split("\t", $_);
my $queryKey="$chr:$start-$stop";
my $queryResult=`samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq= join("", @seqs);if($strand eq "-"){$seq= revdnacomp($seq);}
my $header=">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".uc($seq)."\n";}
sub revdnacomp {
my $dna=shift @_;
my $revcomp= reverse($dna);$revcomp=~ tr/ACGTacgt/TGCAtgca/;return$revcomp;}
This converts six-column BED to FASTA, but it should work with the three-column BED data coming out of your GRange object.
You might run this something like so:
$ bed2fasta.pl < granges.bed > granges.fa
The file granges.fa is the FASTA file for genomic intervals defined in your GRanges object. You can then use Perl or Python (or any text parser) to analyze GC content.