Question

How Would You Scan The Human Genome ?

2

Entering edit mode

13.2 years ago

toni ★ 2.2k

Hi all,

my goal today is to build some custom statistics along the human reference genome. Then I will be able to make comparisons with the content of a given BAM file, for instance.

Let's take a simple & concrete example :

Compute every 50bp the GC content in a 100bp window (overlapping sliding window)

So here, I would need to get a 100bp sequence each 50bp along the reference, and then compute my GC content, and store it somewhere. But how do I efficiently scan the human fasta reference ?

What libraries/tools do exist to accomplish such a task ? (samtools, Picard APIs ?)

Thanks, Tony

reference fasta api human • 5.7k views

ADD COMMENT • link updated 13.2 years ago by Pierre Lindenbaum 166k • written 13.2 years ago by toni ★ 2.2k

score 9 · Answer 1 · 2012-03-14

9

Entering edit mode

13.2 years ago

Aaronquinlan 12k

It's my own software, but I believe bedtools can do what you want.

Step 1: make your sliding windows:

bedtools makewindows -g hg19.chromsizes -w 100 -s 50 > hg19.100w50s.windows.bed
head -5 hg19.100w50s.windows.bed
chr1    0    100
chr1    50    150
chr1    100    200
chr1    150    250
chr1    200    300

Step 2: use the nuc tool to compute the nucleotide content of each window:

bedtools nuc -fi testingData/chr22.fa -bed hg19.100w50s.windows.bed -seq

Here's an example of the output (the 5th column is %GC, and the other columns are explained in the help (-h)),:

chr22    17238800    17238900    0.610000    0.390000    29    26    13    32    0    0    100    tacaagccaatgctcatttgccgttccataaatcctagggcacctcttaagaattatgctaaatctactctacccgtgctctataaatggattcactatt
chr22    17238850    17238950    0.620000    0.380000    35    21    17    27    0    0    100    gaattatgctaaatctactctacccgtgctctataaatggattcactattgtagacgccactaagaacattcatgattcaccagaggaggtaaaaatagc
chr22    17238900    17239000    0.610000    0.390000    37    17    22    24    0    0    100    gtagacgccactaagaacattcatgattcaccagaggaggtaaaaatagcagcattaatgggagtttggaagaagttgattctaaccctcataaatgact
chr22    17238950    17239050    0.600000    0.400000    34    13    27    26    0    0    100    agcattaatgggagtttggaagaagttgattctaaccctcataaatgactttgaggggttcaagactttagcagaggaagtaaccacagatatggtgaag

ADD COMMENT • link 13.2 years ago by Aaronquinlan 12k

0

Entering edit mode

Thanks Aaron. Yes, I was digging into BEDTools as well. Just 3 quick clarifications please :) 1 - No particular issues if I use GRC37 reference instead of UCSC hg19 ? 2 - Is there an BEDTools API or it's just command line tools ? 3 - What does BEDTools use inside to scan the fasta ref :) ?

ADD REPLY • link 13.2 years ago by toni ★ 2.2k

0

Entering edit mode

There is an C++ API, but it is admittedly not yet well-documented, though the code themselves illustrate the usage. The pybedtools Python library provides the best API for programming around the functionality present in BEDTools.

ADD REPLY • link 13.2 years ago by Aaronquinlan 12k

0

Entering edit mode

"No particular issues if I use GRC37 reference instead of UCSC hg19". Not that I know of. The only issues that I can think of are the way the chromosomes are named must be consistent (e.g., 1 != chr1) and the reported chrom sizes much match what is in your FASTA file.

ADD REPLY • link 13.2 years ago by Aaronquinlan 12k

0

Entering edit mode

OK, Thank you very much.

ADD REPLY • link 13.2 years ago by toni ★ 2.2k

score 5 · Answer 2 · 2012-03-14

The following C program should do the job:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#define MAX_SEQ_LENGTH 100
#define WIN_SIZE 100
#define WIN_SHIFT 50

int main(int argc,char** argv)
    {
    int win_size=0;
    int position=0;
    char seqName[MAX_SEQ_LENGTH];
    char window[WIN_SIZE];
    int c;
    seqName[0]=0;
    while((c=fgetc(stdin))!=EOF)
        {
        if(c=='>')
            {
            int len=0;      
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {
                seqName[len++]=c;
                }
            seqName[len]=0;
            position=0;
            win_size=0;
            continue;
            }
        if(isspace(c)) continue;
        window[win_size++]=c;
        if(win_size==WIN_SIZE)
            {
            int j;
            int gc=0;
            fputs(seqName,stdout);
            fputc('\t',stdout);
            fwrite(window,WIN_SIZE,sizeof(char),stdout);
            fputc('\t',stdout);
            for(j=0;j< WIN_SIZE;++j) if(strchr("GCSgcs",window[j])!=NULL) ++gc;
            printf("%d\t%f\n",position,gc/(double)WIN_SIZE);
            memmove(window,&window[WIN_SHIFT],WIN_SIZE-WIN_SHIFT);
            position+=WIN_SHIFT;
            win_size=WIN_SIZE-WIN_SHIFT;
            }
        }

    return EXIT_SUCCESS;
    }

compile and run:

$ gcc -Wall -O3 prog.c
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz" | gunzip -c  | ./a.out  | head
chrM    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG    0   0.530000
chrM    TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT    50  0.540000
chrM    GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATTCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA    100 0.440000
chrM    CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT    150 0.290000
chrM    AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA    200 0.330000
chrM    GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC    250 0.530000
chrM    AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT    300 0.510000
chrM    CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA    350 0.410000
chrM    TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC    400 0.430000
chrM    TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC    450 0.530000

Ram · Answer 3 · 2012-03-14

2

Entering edit mode

13.2 years ago

Mary 11k

You might want to look at how it was done for this track for some ideas. May not be what you want, but could offer some ideas.

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=247068459&c=chr20&g=cpgIslandExt

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.2 years ago by Mary 11k

score 0 · Answer 4 · 2012-03-14

0

Entering edit mode

13.2 years ago

Zev.Kronenberg 12k

Bamtools has a C++ API which I have heard is good. I am going to give it a try as soon as my C++ improves.

ADD COMMENT • link 13.2 years ago by Zev.Kronenberg 12k

score 0 · Answer 5 · 2012-03-14

I think that the GATK framework should be up to the task. If I understand what your want to do correctly, I think that it can be achieved by either making some small modifications to the code of GCContentByIntervalWalker. Or by giving it giving it a file of intervals which match what you want (100 base-pair overlapping sliding window).