Code golf: detecting homopolymers of length N in the (human) genome

Entering edit mode

6.5 years ago

WouterDeCoster 48k

How would you feel about a code golf? Give us your pretiest, shortest, quickest pieces of code! Especially languages that nobody else has posted are welcome...

Objective: create a bed file of all homopolymers (repeats of the same nucleotides) of minimally length N based on the (human or other) reference genome.

The input is a fasta file, for example the human genome

Expected output example:

chr1    11540   11546
chr1    14908   14913
chr1    15468   15473
chr1    16318   16323
chr1    16505   16511
chr1    19735   19741
chr1    20316   20321

A useful benchmark set would be the human chromosome 22. My python code (see below), searching for 5-mers or longer, finds 244503 hits. The first 10 lines are:

chr22   16050521    16050526
chr22   16050548    16050553
chr22   16050570    16050575
chr22   16050578    16050583
chr22   16050679    16050684
chr22   16050835    16050840
chr22   16050932    16050937
chr22   16051192    16051198
chr22   16051303    16051310
chr22   16051311    16051317

code golf fasta homopolymer repeat • 7.0k views

ADD COMMENT • link updated 19 months ago by Ram 45k • written 6.5 years ago by WouterDeCoster 48k

Entering edit mode

DUPLICATE ! DUPLICATE ! DUPLICATE ! How to extract all the simple repeats from the hg19 reference genome

:-D

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 166k

Entering edit mode

We're waiting for your html solution Pierre

ADD REPLY • link 6.5 years ago by WouterDeCoster 48k

Entering edit mode

6.5 years ago

Kevin Blighe 90k

AWK solution

NB - this is a memory hog. It reads the entire FASTA file into memory. It could be designed to be more memory efficient, but works pretty quick on a cluster environment, and is also fine for smaller FASTA files on a laptop/desktop computer.

1,set desired `minimum` homopolymer length as shell variable

HOMOPOLYMER_LENGTH=10

2, determine homopolymers

This first 'flattens' the FASTA file and converts everything to upper case to avoid issues with soft- / repeat-masked sequence

awk '/^>/{if (NR==1) {print $0} else if (NR>1) {print "\n"$0}}; !/^>/ {printf toupper($0)}' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna |\
  awk -v len="${HOMOPOLYMER_LENGTH}" -F '' '/^>/{gsub("^>", "", $0); chr=$0}; !/^>/ {base=$1; start=1; end=1; for (i=2; i<=NF+1; i++) {if (base==$(i)) {end=i} else if (base!=$(i)) {if (end-start>=len-1) {print chr"\t"start"\t"end"\t"base"\t"(end-start)+1}; start=i; end=i; base=$(i)}}}'

chr1    1   10000   N   10000
chr1    28589   28603   T   15
chr1    31720   31733   A   14
chr1    33450   33464   A   15
chr1    33531   33541   T   11
chr1    36352   36364   A   13
chr1    37241   37250   T   10
chr1    40640   40658   T   19
chr1    43797   43812   A   16
chr1    44174   44183   A   10
chr1    46403   46416   T   14
chr1    47319   47328   A   10
chr1    51865   51877   A   13
chr1    61351   61361   A   11
chr1    71176   71186   A   11
chr1    71847   71861   T   15
chr1    73843   73856   T   14
chr1    77175   77195   A   21
chr1    82134   82154   A   21
chr1    91201   91213   A   13

The extra 4th and 5th columns are the homopolymer base and its length, respectively.

Kevin

ADD COMMENT • link 4.8 years ago by Kevin Blighe 90k

Entering edit mode

6.5 years ago

Devon Ryan 105k

This won't win any awards for brevity, but I'd already written it for other purposes. If it helps, the only bit of code it uses that I didn't write myself is argparse.

	#!/usr/bin/env python
	# This can take up to ~20 minutes and use up to ~2GB RAM for mammals
	import argparse
	import py2bit
	from deeptoolsintervals import GTF, tree
	from deeptoolsintervals.parse import openPossiblyCompressed, parseExonBounds, findRandomLabel

	parser = argparse.ArgumentParser(description="Generate a blacklist file of polyX stretches of a given minimum length not within a specified distance of a TES")
	parser.add_argument("--output", "-o", help="Output file", required=True)
	parser.add_argument("--tb", help="2bit file", required=True)
	parser.add_argument("--bed", help="BED file containing transcripts", required=True)
	parser.add_argument("--minLength", help="Minimum length of a polyX stretch (default: 6)", type=int, default=6)
	parser.add_argument("--base", help="The base to check for (A or T)", required=True, choices=['A', 'T'])
	parser.add_argument("--minDistance", help="Minimum distance from a TES to not exclude a site (default: 100)", type=int, default=100)
	parser.add_argument("--extend", help="Number of bases to extend regions (default: 5)", type=int, default=5)
	args = parser.parse_args()

	tb = py2bit.open(args.tb)
	o = open(args.output, "w")

	class TES(GTF):
	def parseBEDcore(self, line, ncols):
	strand = 3
	cols = line.split("\t")
	name = "{0}:{1}-{2}".format(cols[0], cols[1], cols[2])

	if int(cols[1]) < 0:
	cols[1] = 0

	if int(cols[1]) >= int(cols[2]):
	sys.stderr.write("Warning: {0}:{1}-{2} is an invalid BED interval! Ignoring it.\n".format(cols[0], cols[1], cols[2]))
	return

	# BED6/BED12: set name and strand
	score = '.'
	if ncols > 3:
	name = cols[3]
	if cols[5] == '+':
	strand = 0
	elif cols[5] == '-':
	strand = 1
	score = cols[4]

	# filter by strand
	if strand != 3:
	if self.strand == "+" and strand == 1:
	return
	elif self.strand == "-" and strand == 0:
	return

	# Ensure that the name is unique
	name = findRandomLabel(self.exons[self.labelIdx], name)

	assert(len(cols) == 12)
	exons = parseExonBounds(int(cols[1]), int(cols[2]), int(cols[9]), cols[10], cols[11])

	# Extend by strand around the TES
	exonsFinal = []
	lenLeft = 2 * self.minDistance + 1
	if strand == 3 or (self.strand == "+" and strand == 0):
	exons[-1] = (exons[-1][0], exons[-1][1] + self.minDistance)
	for exon in exons[::-1]:
	exonLen = exon[1] - exon[0]
	if exonLen <= lenLeft:
	lenLeft -= exonLen
	exonsFinal.insert(0, exon)
	else:
	exonsFinal.insert(0, (exon[1] - lenLeft, exon[1]))
	break
	if lenLeft <= 0:
	break
	elif self.strand == "-" and strand == 1:
	_ = exons[0][0] -self.minDistance
	_ = max(0, _)
	exons[0] = (_, exons[0][1])
	for exon in exons:
	exonLen = exon[1] - exon[0]
	if exonLen <= lenLeft:
	lenLeft -= exonLen
	exonsFinal.append(exon)
	else:
	exonsFinal.append((exon[0], exon[0] + lenLeft))
	break
	if lenLeft <= 0:
	break
	if len(exonsFinal) == 0:
	return

	self.tree.addEntry(self.mungeChromosome(cols[0]), exonsFinal[0][0], exonsFinal[-1][1], name, strand, self.labelIdx, score)
	self.exons[self.labelIdx][name] = exonsFinal


	def __init__(self, fname, minDistance=100, strand="+"):
	self.fname = [fname]
	self.filename = fname
	self.filename = ""
	self.chroms = []
	self.exons = []
	self.labels = []
	self.transcriptIDduplicated = []
	self.tree = tree.initTree()
	self.labelIdx = 0
	self.keepExons = True
	self.defaultGroup = None
	self.verbose = False
	self.minDistance = minDistance
	self.strand = strand

	fp = openPossiblyCompressed(fname)
	line, labelColumn = self.firstNonComment(fp)
	assert(line) # This will only fail on empty files
	line = line.strip()

	ftype = self.inferType(fp, line, labelColumn)
	self.parseBED(fp, line, 12, labelColumn)
	fp.close()

	# Sanity check
	if self.tree.countEntries() == 0:
	raise RuntimeError("None of the input BED/GTF files had valid regions")

	# vine -> tree
	self.tree.finish()


	def processLast(last, chrom, idx, idx2, o, bed):
	if args.base == "A":
	s = max(0, idx - args.extend)
	e = idx2
	else:
	s = idx
	e = idx2 + args.extend
	for overlaps in bed.findOverlaps(chrom, s, e):
	for exon in overlaps[4]:
	if exon[0] < s and exon[1] > s:
	return
	if exon[0] >= s and exon[0] < e:
	return
	if not last[0]:
	last[0] = chrom
	last[1] = s
	last[2] = e
	else:
	if last[0] == chrom and s <= last[2]:
	last[2] = e
	else:
	o.write("{}\t{}\t{}\n".format(*last))
	last[0] = chrom
	last[1] = s
	last[2] = e


	if args.base == "A":
	bed = TES(args.bed, minDistance=args.minDistance)
	else:
	bed = TES(args.bed, minDistance=args.minDistance, strand="-")

	last = [None, None, None]

	for chrom, chromLength in tb.chroms().items():
	s = tb.sequence(chrom)
	idx = 0
	idx2 = 0
	while idx < chromLength - 1:
	if s[idx] == args.base:
	idx2 = idx + 1
	while s[idx2] == args.base:
	if idx2 + 1 >= chromLength:
	break
	idx2 += 1
	if idx2 - idx >= args.minLength - 1:
	processLast(last, chrom, idx, idx2 + 1, o, bed)
	idx = idx2 + 1
	else:
	idx += 1

	if last[0] is not None:
	o.write("{}\t{}\t{}\n".format(*last))
	o.close()
	tb.close()

view raw findSites.py hosted with ❤ by GitHub

ADD COMMENT • link 6.5 years ago by Devon Ryan 105k

Entering edit mode

6.5 years ago

mbelmadani ★ 1.4k

I have a solution part in Go (because it does it really efficiently), part in Bash (for the golfing):

package main
import "fmt"
import "index/suffixarray"
import "io/ioutil"
import "regexp"
import "os"

func main() {
     M := os.Args[1]
     r := "[aA]{"+M+",}|[cC]{"+M+",}|[gG]{"+M+",}|[tT]{"+M+",}"
     R, _ := regexp.Compile(r)
     b, _ :=  ioutil.ReadAll(os.Stdin)
     index := suffixarray.New(b)
     fmt.Println(index.FindAllIndex( R, -1))        
}

The Go standard library includes a suffix array, which does all the heavy lifting for us.

Then you can do this for each chromosome (maybe in parallel for the whole genome, why not.) The chromosome is input by the standard input, and 5 in this case is the minimum size (called M in the Go code.)

Using chr22:

$ time tail -n+2 chr22.fa | ./homopolymer 5 | sed -e 's|\] \[|\n|g' | tr -d "[" | tr -d "]"  | sed 's|^|chr22\t|g' > homopolymer.bed


$ head homopolymer.bed 
chr22   16371531 16371536
chr22   16371581 16371586
chr22   16371589 16371594
chr22   16371692 16371697
chr22   16371851 16371856
chr22   16371950 16371955
chr22   16372215 16372221
chr22   16372329 16372336
chr22   16372337 16372343
chr22   16372418 16372425

Timing:

real    1m11.996s
user    1m12.240s
sys     0m0.408s

ADD COMMENT • link 6.5 years ago by mbelmadani ★ 1.4k

Entering edit mode

6.5 years ago

Pierre Lindenbaum 166k

C solution:

$ time ./biostar379454 chr22.fa 5
(...)
real    0m2.439s
user    0m0.475s
sys 0m0.538s

	/**
	https://www.biostars.org/p/379454/#379505

	Code golf: detecting homopolymers of length N in the (human) genome

	Author: Pierre Lindenbaum

	compilation:

	gcc -O3 -Wall -o biostar379454 biostar379454.c

	*/
	#include <stdio.h>
	#include <fcntl.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <sys/io.h>
	#include <sys/mman.h>
	#include <string.h>
	#include <stdlib.h>
	#include <errno.h>
	#include <ctype.h>

	#define DUMP if(len_repeat>=len) {fputs(seq_name,stdout);printf("\t%d\t%d\t%c[%d]\n",pos-len_repeat,pos,prev_c,len_repeat); }len_repeat=0;
	#define BUF_STDOUT 1000000
	int main(int argc, char const *argv[]) {
	char *seq;
	char* buff=NULL;
	size_t size,x;
	int len=0,prev_c=-1,len_repeat=0,pos=0;
	char* seq_name=NULL;
	struct stat s;
	int fd;
	if(argc!=3) {
	fprintf(stderr,"Usage: %s fasta size.\n",argv[0]);
	return EXIT_FAILURE;
	}

	fd = open (argv[1], O_RDONLY);
	if(fd < 0) {
	fprintf(stderr,"Cannot open: %s %s.\n",argv[1],strerror(errno));
	return EXIT_FAILURE;
	}
	len = atoi(argv[2]);

	buff=(char*)malloc(BUF_STDOUT);
	if(buff==NULL) {
	fprintf(stderr,"Out of memory\n");
	return EXIT_FAILURE;
	}
	setvbuf(stdout, buff, _IOFBF, BUF_STDOUT);

	if(len < 2) {
	fprintf(stderr,"bad length %s.\n",argv[2]);
	return EXIT_FAILURE;
	}
	/* Get the size of the file. */
	if(fstat (fd, & s)!=0) {
	fprintf(stderr,"Cannot stat: %s %s.\n",argv[1],strerror(errno));
	return EXIT_FAILURE;
	}
	size = s.st_size;

	seq = (char *) mmap (0, size, PROT_READ, MAP_PRIVATE \| MAP_POPULATE, fd, 0);
	x=0;
	while(x<size) {
	if(seq[x]=='>') {
	size_t x0=x;
	DUMP;
	free(seq_name);
	while(seq[x]!='\n' && x < size) x++;
	seq_name = strndup(&seq[x0+1],x-x0-1);
	len_repeat=0;
	pos=0;
	prev_c=-1;
	}
	else {
	int c=toupper(seq[x++]);
	if(isspace(c)) continue;
	++pos;
	if(prev_c==c) {
	++len_repeat;
	}
	else
	{
	DUMP;
	prev_c=c;
	}
	}
	}
	fflush(stdout);
	free(buff);
	munmap(seq, size);
	free(seq_name);
	return 0;
	}

view raw biostar379454.c hosted with ❤ by GitHub

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 166k

Entering edit mode

I suspect this will be faster by not mmaping the file but instead reading in block sized chunks.

ADD REPLY • link 6.5 years ago by Devon Ryan 105k

Entering edit mode

6.5 years ago

Damian Kao 16k

Takes in fasta file and a second parameter for homopolymer length.

It streams through each line to find homopolymers. Outputs chromosome, start, end, homopolymer base, length of homopolymer.

I tested it out on this fasta file:

>A
AGTCAAAA
GGGGTTTTCCCC
>B
AGTCCCCCTTTTAAAA
GGGGTTTTCCCC
AAAATTTT

And it outputs:

A       4       8       A       4
A       8       12      G       4
A       12      16      T       4
A       16      20      C       4
B       3       8       C       5
B       8       12      T       4
B       12      16      A       4
B       16      20      G       4
B       20      24      T       4
B       24      28      C       4
B       28      32      A       4
B       32      36      T       4

Here's the script

import sys

inFile = open(sys.argv[1],'r')
l = int(sys.argv[2])
chrome = base = start = end = -1

for line in inFile:
    if line[0] == '>':
        if end - start > l:
            print('\t'.join(map(str,[chrome,start,end,base,end - start])))
        chrome = line.strip().split()[0][1:]
        base = -1
        start = end = 0
    else:
        for b in line.strip():
            if b != base:
                if end - start > l:
                    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
                start = end
                base = b
            end += 1

if end - start > l:
    print('\t'.join(map(str,[chrome,start,end,base,end - start])))

ADD COMMENT • link updated 19 months ago by Ram 45k • written 6.5 years ago by Damian Kao 16k

Entering edit mode

6.5 years ago

WouterDeCoster 48k

I'll start with a Python snippet I had lying around, but can it be improved?

As input it takes just a fasta file (sys.argv[1]). This example searches for homopolymers of length 5 and up.

import sys
import re
from Bio import SeqIO

for nucl in ['A','C','T','G']:
    pattern = re.compile(nucl + "{5,}")
    for record in SeqIO.parse(sys.argv[1], "fasta"):
        for match in pattern.finditer(str(record.seq).upper()):
            print("\t".join([record.id, str(match.start()), str(match.end())]))

ADD COMMENT • link updated 19 months ago by Ram 45k • written 6.5 years ago by WouterDeCoster 48k

Entering edit mode

6.5 years ago

zx8754 12k

Using R, there must be a Biostrings-way of doing this, I am just using rle:

library(Biostrings)

N = 10

# http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
x <- readBStringSet("../chr22.fa")
xRLE <- rle(unlist(strsplit(toupper(toString(x)), "")))

pos <- cumsum(c(1, xRLE$lengths))
ix <- which(xRLE$lengths >= N & xRLE$values %in% c("A", "T", "C", "G"))

res <- data.frame(chrom = "chr22",
                  start = pos[ ix ],
                  end = pos[ ix ] + xRLE$lengths[ ix ] - 1)

head(res)
#   chrom    start      end
# 1 chr22 16069784 16069795
# 2 chr22 16072433 16072447
# 3 chr22 16082599 16082608
# 4 chr22 16085351 16085376
# 5 chr22 16085557 16085568
# 6 chr22 16096059 16096068

Let's test:

subseq(x, 16069784, 16069795)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    12 aaaaaaaaaaaa                 chr22
subseq(x, 16072433, 16072447)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    15 AAAAAAAAAAAAAAA              chr22

ADD COMMENT • link updated 19 months ago by Ram 45k • written 6.5 years ago by zx8754 12k

Entering edit mode

Yep, my Biostrings solution see above. Edited the original script I posted (wrote like > 2 years ago as a beginner), now made it like 10 times shorter and parallelized.

ADD REPLY • link 6.5 years ago by ATpoint 90k

AWK solution

1,set desired minimum homopolymer length as shell variable

2, determine homopolymers

1,set desired `minimum` homopolymer length as shell variable