Code golf: detecting homopolymers of length N in the (human) genome
7
3
Entering edit mode
5.6 years ago

How would you feel about a code golf? Give us your pretiest, shortest, quickest pieces of code! Especially languages that nobody else has posted are welcome...

Objective: create a bed file of all homopolymers (repeats of the same nucleotides) of minimally length N based on the (human or other) reference genome.

The input is a fasta file, for example the human genome

Expected output example:

chr1    11540   11546
chr1    14908   14913
chr1    15468   15473
chr1    16318   16323
chr1    16505   16511
chr1    19735   19741
chr1    20316   20321

A useful benchmark set would be the human chromosome 22. My python code (see below), searching for 5-mers or longer, finds 244503 hits. The first 10 lines are:

chr22   16050521    16050526
chr22   16050548    16050553
chr22   16050570    16050575
chr22   16050578    16050583
chr22   16050679    16050684
chr22   16050835    16050840
chr22   16050932    16050937
chr22   16051192    16051198
chr22   16051303    16051310
chr22   16051311    16051317
code golf fasta homopolymer repeat • 5.6k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

We're waiting for your html solution Pierre

ADD REPLY
5
Entering edit mode
5.6 years ago

AWK solution

NB - this is a memory hog. It reads the entire FASTA file into memory. It could be designed to be more memory efficient, but works pretty quick on a cluster environment, and is also fine for smaller FASTA files on a laptop/desktop computer.

1,set desired minimum homopolymer length as shell variable

HOMOPOLYMER_LENGTH=10

2, determine homopolymers

This first 'flattens' the FASTA file and converts everything to upper case to avoid issues with soft- / repeat-masked sequence

awk '/^>/{if (NR==1) {print $0} else if (NR>1) {print "\n"$0}}; !/^>/ {printf toupper($0)}' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna |\
  awk -v len="${HOMOPOLYMER_LENGTH}" -F '' '/^>/{gsub("^>", "", $0); chr=$0}; !/^>/ {base=$1; start=1; end=1; for (i=2; i<=NF+1; i++) {if (base==$(i)) {end=i} else if (base!=$(i)) {if (end-start>=len-1) {print chr"\t"start"\t"end"\t"base"\t"(end-start)+1}; start=i; end=i; base=$(i)}}}'

chr1    1   10000   N   10000
chr1    28589   28603   T   15
chr1    31720   31733   A   14
chr1    33450   33464   A   15
chr1    33531   33541   T   11
chr1    36352   36364   A   13
chr1    37241   37250   T   10
chr1    40640   40658   T   19
chr1    43797   43812   A   16
chr1    44174   44183   A   10
chr1    46403   46416   T   14
chr1    47319   47328   A   10
chr1    51865   51877   A   13
chr1    61351   61361   A   11
chr1    71176   71186   A   11
chr1    71847   71861   T   15
chr1    73843   73856   T   14
chr1    77175   77195   A   21
chr1    82134   82154   A   21
chr1    91201   91213   A   13

The extra 4th and 5th columns are the homopolymer base and its length, respectively.

Kevin

ADD COMMENT
3
Entering edit mode
5.6 years ago

This won't win any awards for brevity, but I'd already written it for other purposes. If it helps, the only bit of code it uses that I didn't write myself is argparse.

ADD COMMENT
3
Entering edit mode
5.6 years ago

I have a solution part in Go (because it does it really efficiently), part in Bash (for the golfing):

package main
import "fmt"
import "index/suffixarray"
import "io/ioutil"
import "regexp"
import "os"

func main() {
     M := os.Args[1]
     r := "[aA]{"+M+",}|[cC]{"+M+",}|[gG]{"+M+",}|[tT]{"+M+",}"
     R, _ := regexp.Compile(r)
     b, _ :=  ioutil.ReadAll(os.Stdin)
     index := suffixarray.New(b)
     fmt.Println(index.FindAllIndex( R, -1))        
}

The Go standard library includes a suffix array, which does all the heavy lifting for us.

Then you can do this for each chromosome (maybe in parallel for the whole genome, why not.) The chromosome is input by the standard input, and 5 in this case is the minimum size (called M in the Go code.)

Using chr22:

$ time tail -n+2 chr22.fa | ./homopolymer 5 | sed -e 's|\] \[|\n|g' | tr -d "[" | tr -d "]"  | sed 's|^|chr22\t|g' > homopolymer.bed


$ head homopolymer.bed 
chr22   16371531 16371536
chr22   16371581 16371586
chr22   16371589 16371594
chr22   16371692 16371697
chr22   16371851 16371856
chr22   16371950 16371955
chr22   16372215 16372221
chr22   16372329 16372336
chr22   16372337 16372343
chr22   16372418 16372425

Timing:

real    1m11.996s
user    1m12.240s
sys     0m0.408s
ADD COMMENT
3
Entering edit mode
5.6 years ago

C solution:

$ time ./biostar379454 chr22.fa 5
(...)
real    0m2.439s
user    0m0.475s
sys 0m0.538s

ADD COMMENT
0
Entering edit mode

I suspect this will be faster by not mmaping the file but instead reading in block sized chunks.

ADD REPLY
2
Entering edit mode
5.6 years ago

Takes in fasta file and a second parameter for homopolymer length.

It streams through each line to find homopolymers. Outputs chromosome, start, end, homopolymer base, length of homopolymer.

I tested it out on this fasta file:

>A
AGTCAAAA
GGGGTTTTCCCC
>B
AGTCCCCCTTTTAAAA
GGGGTTTTCCCC
AAAATTTT

And it outputs:

A       4       8       A       4
A       8       12      G       4
A       12      16      T       4
A       16      20      C       4
B       3       8       C       5
B       8       12      T       4
B       12      16      A       4
B       16      20      G       4
B       20      24      T       4
B       24      28      C       4
B       28      32      A       4
B       32      36      T       4

Here's the script

import sys

inFile = open(sys.argv[1],'r')
l = int(sys.argv[2])
chrome = base = start = end = -1

for line in inFile:
    if line[0] == '>':
        if end - start > l:
            print('\t'.join(map(str,[chrome,start,end,base,end - start])))
        chrome = line.strip().split()[0][1:]
        base = -1
        start = end = 0
    else:
        for b in line.strip():
            if b != base:
                if end - start > l:
                    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
                start = end
                base = b
            end += 1

if end - start > l:
    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
ADD COMMENT
1
Entering edit mode
5.6 years ago

I'll start with a Python snippet I had lying around, but can it be improved?

As input it takes just a fasta file (sys.argv[1]). This example searches for homopolymers of length 5 and up.

import sys
import re
from Bio import SeqIO

for nucl in ['A','C','T','G']:
    pattern = re.compile(nucl + "{5,}")
    for record in SeqIO.parse(sys.argv[1], "fasta"):
        for match in pattern.finditer(str(record.seq).upper()):
            print("\t".join([record.id, str(match.start()), str(match.end())]))
ADD COMMENT
1
Entering edit mode
5.6 years ago
zx8754 12k

Using R, there must be a Biostrings-way of doing this, I am just using rle:

library(Biostrings)

N = 10

# http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
x <- readBStringSet("../chr22.fa")
xRLE <- rle(unlist(strsplit(toupper(toString(x)), "")))

pos <- cumsum(c(1, xRLE$lengths))
ix <- which(xRLE$lengths >= N & xRLE$values %in% c("A", "T", "C", "G"))

res <- data.frame(chrom = "chr22",
                  start = pos[ ix ],
                  end = pos[ ix ] + xRLE$lengths[ ix ] - 1)

head(res)
#   chrom    start      end
# 1 chr22 16069784 16069795
# 2 chr22 16072433 16072447
# 3 chr22 16082599 16082608
# 4 chr22 16085351 16085376
# 5 chr22 16085557 16085568
# 6 chr22 16096059 16096068

Let's test:

subseq(x, 16069784, 16069795)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    12 aaaaaaaaaaaa                 chr22
subseq(x, 16072433, 16072447)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    15 AAAAAAAAAAAAAAA              chr22
ADD COMMENT
0
Entering edit mode

Yep, my Biostrings solution see above. Edited the original script I posted (wrote like > 2 years ago as a beginner), now made it like 10 times shorter and parallelized.

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6