How Would You Scan The Human Genome ?
5
2
Entering edit mode
12.7 years ago
toni ★ 2.2k

Hi all,

my goal today is to build some custom statistics along the human reference genome. Then I will be able to make comparisons with the content of a given BAM file, for instance.

Let's take a simple & concrete example :

Compute every 50bp the GC content in a 100bp window (overlapping sliding window)

So here, I would need to get a 100bp sequence each 50bp along the reference, and then compute my GC content, and store it somewhere. But how do I efficiently scan the human fasta reference ?

What libraries/tools do exist to accomplish such a task ? (samtools, Picard APIs ?)

Thanks, Tony

reference fasta api human • 5.3k views
ADD COMMENT
9
Entering edit mode
12.7 years ago

It's my own software, but I believe bedtools can do what you want.

Step 1: make your sliding windows:

bedtools makewindows -g hg19.chromsizes -w 100 -s 50 > hg19.100w50s.windows.bed
head -5 hg19.100w50s.windows.bed
chr1    0    100
chr1    50    150
chr1    100    200
chr1    150    250
chr1    200    300

Step 2: use the nuc tool to compute the nucleotide content of each window:

bedtools nuc -fi testingData/chr22.fa -bed hg19.100w50s.windows.bed -seq

Here's an example of the output (the 5th column is %GC, and the other columns are explained in the help (-h)),:

chr22    17238800    17238900    0.610000    0.390000    29    26    13    32    0    0    100    tacaagccaatgctcatttgccgttccataaatcctagggcacctcttaagaattatgctaaatctactctacccgtgctctataaatggattcactatt
chr22    17238850    17238950    0.620000    0.380000    35    21    17    27    0    0    100    gaattatgctaaatctactctacccgtgctctataaatggattcactattgtagacgccactaagaacattcatgattcaccagaggaggtaaaaatagc
chr22    17238900    17239000    0.610000    0.390000    37    17    22    24    0    0    100    gtagacgccactaagaacattcatgattcaccagaggaggtaaaaatagcagcattaatgggagtttggaagaagttgattctaaccctcataaatgact
chr22    17238950    17239050    0.600000    0.400000    34    13    27    26    0    0    100    agcattaatgggagtttggaagaagttgattctaaccctcataaatgactttgaggggttcaagactttagcagaggaagtaaccacagatatggtgaag
ADD COMMENT
0
Entering edit mode

Thanks Aaron. Yes, I was digging into BEDTools as well. Just 3 quick clarifications please :) 1 - No particular issues if I use GRC37 reference instead of UCSC hg19 ? 2 - Is there an BEDTools API or it's just command line tools ? 3 - What does BEDTools use inside to scan the fasta ref :) ?

ADD REPLY
0
Entering edit mode

There is an C++ API, but it is admittedly not yet well-documented, though the code themselves illustrate the usage. The pybedtools Python library provides the best API for programming around the functionality present in BEDTools.

ADD REPLY
0
Entering edit mode

"No particular issues if I use GRC37 reference instead of UCSC hg19". Not that I know of. The only issues that I can think of are the way the chromosomes are named must be consistent (e.g., 1 != chr1) and the reported chrom sizes much match what is in your FASTA file.

ADD REPLY
0
Entering edit mode

OK, Thank you very much.

ADD REPLY
5
Entering edit mode
12.7 years ago

The following C program should do the job:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#define MAX_SEQ_LENGTH 100
#define WIN_SIZE 100
#define WIN_SHIFT 50

int main(int argc,char** argv)
    {
    int win_size=0;
    int position=0;
    char seqName[MAX_SEQ_LENGTH];
    char window[WIN_SIZE];
    int c;
    seqName[0]=0;
    while((c=fgetc(stdin))!=EOF)
        {
        if(c=='>')
            {
            int len=0;      
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {
                seqName[len++]=c;
                }
            seqName[len]=0;
            position=0;
            win_size=0;
            continue;
            }
        if(isspace(c)) continue;
        window[win_size++]=c;
        if(win_size==WIN_SIZE)
            {
            int j;
            int gc=0;
            fputs(seqName,stdout);
            fputc('\t',stdout);
            fwrite(window,WIN_SIZE,sizeof(char),stdout);
            fputc('\t',stdout);
            for(j=0;j< WIN_SIZE;++j) if(strchr("GCSgcs",window[j])!=NULL) ++gc;
            printf("%d\t%f\n",position,gc/(double)WIN_SIZE);
            memmove(window,&window[WIN_SHIFT],WIN_SIZE-WIN_SHIFT);
            position+=WIN_SHIFT;
            win_size=WIN_SIZE-WIN_SHIFT;
            }
        }

    return EXIT_SUCCESS;
    }

compile and run:

$ gcc -Wall -O3 prog.c
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz" | gunzip -c  | ./a.out  | head
chrM    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG    0   0.530000
chrM    TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT    50  0.540000
chrM    GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATTCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA    100 0.440000
chrM    CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT    150 0.290000
chrM    AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA    200 0.330000
chrM    GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC    250 0.530000
chrM    AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT    300 0.510000
chrM    CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA    350 0.410000
chrM    TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC    400 0.430000
chrM    TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC    450 0.530000
ADD COMMENT
1
Entering edit mode

time ./grc37GC100w50s < /references/human/b37/humang1kv37.fasta > grc37.gc.100w50s.txt

real 2m34.441s user 2m27.681s sys 0m6.224s

Nice ! Nothing beats a C parser. Thanks Pierre!

ADD REPLY
2
Entering edit mode
12.7 years ago
Mary 11k

You might want to look at how it was done for this track for some ideas. May not be what you want, but could offer some ideas.

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=247068459&c=chr20&g=cpgIslandExt

ADD COMMENT
0
Entering edit mode
12.7 years ago

Bamtools has a C++ API which I have heard is good. I am going to give it a try as soon as my C++ improves.

ADD COMMENT
0
Entering edit mode
12.7 years ago
Johan ▴ 890

I think that the GATK framework should be up to the task. If I understand what your want to do correctly, I think that it can be achieved by either making some small modifications to the code of GCContentByIntervalWalker. Or by giving it giving it a file of intervals which match what you want (100 base-pair overlapping sliding window).

ADD COMMENT

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6