Question

chrom.sizes computed locally

0

Entering edit mode

7.8 years ago

ypriverol • 0

Hi all:

In order to be able to convert from bed files to bigbed using the UCSC tool (bedToBigBed) chrom.sizes is needed. How can this number be computed without querying the UCSC and ENSEMBL APIs?.

bed bigbed ucsc ensembl • 7.7k views

ADD COMMENT • link updated 7.8 years ago by cpad0112 21k • written 7.8 years ago by ypriverol • 0

0

Entering edit mode

How can this number be computed without querying the UCSC

what do you mean with "querying" ? http ? mysql ? this is a small file. what's wrong with having a local copy ? or do you have a local copy of the FASTA sequences ?

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I would like to do it based on the fasta files they provide in the FTP.

ADD REPLY • link 7.8 years ago by ypriverol • 0

0

Entering edit mode

The problem is that known of the APIs said on which files these numbers are compute.

ADD REPLY • link 7.8 years ago by ypriverol • 0

0

Entering edit mode

Do you know how to do it from the fasta files ensembl provides here: https://m.ensembl.org/info/data/ftp/index.html

ADD REPLY • link 7.8 years ago by ypriverol • 0

1

Entering edit mode

~$ curl -s "ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.Y.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
57227415

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

$ curl -s "ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_90_38/seq_region.txt.gz" | gunzip -c| awk '($3=="4")'  | grep -v CHR | cut -f 2,4 | sort -k2,2n
MT  16569
21  46709983
22  50818468
Y   57227415
19  58617616
20  64444167
18  80373285
17  83257441
16  90338345
15  101991189
14  107043718
13  114364328
12  133275309
10  133797422
11  135086622
9   138394717
8   145138636
X   156040895
7   159345973
6   170805979
5   181538259
4   190214555
3   198295559
2   242193529
1   248956422

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 7.8 years ago by GenoMax 152k

score 6 · Answer 1 · 2017-09-13

6

Entering edit mode

7.8 years ago

kashiff007 ★ 1.9k

Try samtools

samtools faidx genome.fa

cut -f1,2 genome.fa.fai > genome.size

ADD COMMENT • link 7.8 years ago by kashiff007 ★ 1.9k

score 2 · Answer 2 · 2017-09-13

~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -N -e 'select chrom,size from chromInfo' > out.txt

$ cat out.txt
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
(....)

or just

 curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" | gunzip -c | cut -f 1,2 > out.txt

these numbers are pre-computed from the fasta genome files. e.g for chr1:

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
249250621

score 2 · Answer 3 · 2017-09-13

The chrom.sizes file is computed in the following way for all assemblies at UCSC:

faToTwoBit organism.fa organism.2bit
twoBitInfo out.2bit stdout | sort -k2rn > organism.chrom.sizes

If you know the URL to a 2bit file we've already made, twoBitInfo accepts a URL like so:

twoBitInfo -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes

If you want the chrom.sizes file for a particular assembly, you can download from a URL like the following: http://hgdownload.cse.ucsc.edu/goldenPath/$dbbigZips/$db.chrom.sizes

where $db is the assembly name like hg38, mm10, anoCar2, panTro5, etc.

You can find the faToTwoBit and twoBitInfo programs in our list of publicly available utilities in the directory appropriate to your operating system:

http://hgdownload.soe.ucsc.edu/admin/exe/

If you have further questions about the UCSC Genome Browser or our utilites or data, feel free to send an email to one of mailing lists below:

genome@soe.ucsc.edu for general questions (public list)
genome-www@soe.ucsc.edu for question concerning private data (private list)
genome-mirror@soe.ucsc.edu for questions concerning the setup and running of your own UCSC Genome Browser installation

ChrisL from the UCSC Genome Browser

score 2 · Answer 4 · 2017-09-15

Try faCount form UCSC kent utils. Usage and output would look like this:

$ faCount hg38.fa

#seq    len A   C   G   T   N   cpg
chr1    248956422   67070277    48055043    48111528    67244164    18475410    2375159
chr10   133797422   38875926    27639505    27719976    39027555    534460  1388978
chr11   135086622   39286730    27903257    27981801    39361954    552880  1333114
chr11_KI270721v1_random 100316  18375   31042   31012   19887   0   3394
.
.
.
total   3209286105  898285419   623727342   626335137   900967885   159970322   30979743

For your purpose, first two columns would suffice.

score 1 · Answer 5 · 2017-09-13

1

Entering edit mode

7.8 years ago

Matt Shirley 10k

$ pip install pyfaidx
$ faidx -i chromsizes input.fa > output.chromsizes

ADD COMMENT • link 7.8 years ago by Matt Shirley 10k