Question

Creating a fasta file

0

Entering edit mode

9.3 years ago

azadtanzinache • 0

I will have to make a fasta file, my assignment is to plot the GC content of chr20 found from UCSC genome browser. I have made a GRange object, but I do not know how to proceed to get the fasta file. My GRange object looks like -

GRanges object with 524 ranges and 0 metadata columns:
        seqnames               ranges strand
           <Rle>            <IRanges>  <Rle>
     33    chr20 [61641628, 61642150]      *
    107    chr20 [44600801, 44601199]      *
    183    chr20 [48292184, 48292583]      *
    387    chr20 [19925740, 19926166]      *
    432    chr20 [17806836, 17807156]      *
    ...      ...                  ...    ...
  23289    chr20 [17940061, 17940581]      *
  23314    chr20 [24819765, 24820285]      *
  23335    chr20 [31559864, 31560384]      *
  23404    chr20 [ 5584130,  5584303]      *
  23448    chr20 [49574803, 49575323]      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths.

Can anyone help me please? It would be a big help.

fasta UCSC-Genome-browser • 2.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.3 years ago by azadtanzinache • 0

score 0 · Answer 1 · 2016-03-28

You could export your GRanges object to a BED file. See this Biostars question and related answers: How To Write Data In A Granges Object To A Bed File.

Assuming you are working with human genome build hg19, you could build a set of indexed FASTA files with samtools:

$ cd /foo/bar/baz
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz
$ for fn in `ls *.fa.gz`; do gunzip $fn; done
$ for fn in `ls *.fa`; do samtools faidx $fn; done

You could use a helper script like the following Perl script to query the indexed FASTA files you made in step 2, against the BED file you exported in step 1:

#!/usr/bin/env perl

use strict;
use warnings;

# fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz"; 

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    my $queryKey = "$chr:$start-$stop";
    my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
    my @lines = split("\n", $queryResult);
    my @seqs = @lines[1..(scalar @lines - 1)];
    my $seq = join("", @seqs);
    if ($strand eq "-") { $seq = revdnacomp($seq); }
    my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
    print STDOUT $header."\n".uc($seq)."\n";
}

sub revdnacomp {
    my $dna = shift @_;
    my $revcomp = reverse($dna);
    $revcomp =~ tr/ACGTacgt/TGCAtgca/;
    return $revcomp;
}

This converts six-column BED to FASTA, but it should work with the three-column BED data coming out of your GRange object.

You might run this something like so:

$ bed2fasta.pl < granges.bed > granges.fa

The file granges.fa is the FASTA file for genomic intervals defined in your GRanges object. You can then use Perl or Python (or any text parser) to analyze GC content.