Creating a fasta file
1
0
Entering edit mode
8.8 years ago

I will have to make a fasta file, my assignment is to plot the GC content of chr20 found from UCSC genome browser. I have made a GRange object, but I do not know how to proceed to get the fasta file. My GRange object looks like -

GRanges object with 524 ranges and 0 metadata columns:
        seqnames               ranges strand
           <Rle>            <IRanges>  <Rle>
     33    chr20 [61641628, 61642150]      *
    107    chr20 [44600801, 44601199]      *
    183    chr20 [48292184, 48292583]      *
    387    chr20 [19925740, 19926166]      *
    432    chr20 [17806836, 17807156]      *
    ...      ...                  ...    ...
  23289    chr20 [17940061, 17940581]      *
  23314    chr20 [24819765, 24820285]      *
  23335    chr20 [31559864, 31560384]      *
  23404    chr20 [ 5584130,  5584303]      *
  23448    chr20 [49574803, 49575323]      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths.

Can anyone help me please? It would be a big help.

fasta UCSC-Genome-browser • 2.6k views
ADD COMMENT
0
Entering edit mode
8.8 years ago
  1. You could export your GRanges object to a BED file. See this Biostars question and related answers: How To Write Data In A Granges Object To A Bed File.

  2. Assuming you are working with human genome build hg19, you could build a set of indexed FASTA files with samtools:

    $ cd /foo/bar/baz
    $ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz
    $ for fn in `ls *.fa.gz`; do gunzip $fn; done
    $ for fn in `ls *.fa`; do samtools faidx $fn; done
    
  3. You could use a helper script like the following Perl script to query the indexed FASTA files you made in step 2, against the BED file you exported in step 1:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    # fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
    my $fastaDir = "/foo/bar/baz"; 
    
    while (<STDIN>) {
        chomp;
        my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
        my $queryKey = "$chr:$start-$stop";
        my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
        my @lines = split("\n", $queryResult);
        my @seqs = @lines[1..(scalar @lines - 1)];
        my $seq = join("", @seqs);
        if ($strand eq "-") { $seq = revdnacomp($seq); }
        my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
        print STDOUT $header."\n".uc($seq)."\n";
    }
    
    sub revdnacomp {
        my $dna = shift @_;
        my $revcomp = reverse($dna);
        $revcomp =~ tr/ACGTacgt/TGCAtgca/;
        return $revcomp;
    }
    

    This converts six-column BED to FASTA, but it should work with the three-column BED data coming out of your GRange object.

    You might run this something like so:

    $ bed2fasta.pl < granges.bed > granges.fa
    

The file granges.fa is the FASTA file for genomic intervals defined in your GRanges object. You can then use Perl or Python (or any text parser) to analyze GC content.

ADD COMMENT

Login before adding your answer.

Traffic: 3885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6