Hi all:
In order to be able to convert from bed files to bigbed using the UCSC tool (bedToBigBed) chrom.sizes
is needed. How can this number be computed without querying the UCSC and ENSEMBL APIs?.
Hi all:
In order to be able to convert from bed files to bigbed using the UCSC tool (bedToBigBed) chrom.sizes
is needed. How can this number be computed without querying the UCSC and ENSEMBL APIs?.
Try samtools
samtools faidx genome.fa
cut -f1,2 genome.fa.fai > genome.size
~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -N -e 'select chrom,size from chromInfo' > out.txt
$ cat out.txt
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
(....)
or just
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" | gunzip -c | cut -f 1,2 > out.txt
these numbers are pre-computed from the fasta genome files. e.g for chr1:
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
249250621
The chrom.sizes file is computed in the following way for all assemblies at UCSC:
faToTwoBit organism.fa organism.2bit twoBitInfo out.2bit stdout | sort -k2rn > organism.chrom.sizes
If you know the URL to a 2bit file we've already made, twoBitInfo accepts a URL like so:
twoBitInfo -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
If you want the chrom.sizes file for a particular assembly, you can download from a URL like the following: http://hgdownload.cse.ucsc.edu/goldenPath/$dbbigZips/$db.chrom.sizes
where $db is the assembly name like hg38, mm10, anoCar2, panTro5, etc.
You can find the faToTwoBit and twoBitInfo programs in our list of publicly available utilities in the directory appropriate to your operating system:
http://hgdownload.soe.ucsc.edu/admin/exe/
If you have further questions about the UCSC Genome Browser or our utilites or data, feel free to send an email to one of mailing lists below:
ChrisL from the UCSC Genome Browser
Try faCount form UCSC kent utils. Usage and output would look like this:
$ faCount hg38.fa
#seq len A C G T N cpg
chr1 248956422 67070277 48055043 48111528 67244164 18475410 2375159
chr10 133797422 38875926 27639505 27719976 39027555 534460 1388978
chr11 135086622 39286730 27903257 27981801 39361954 552880 1333114
chr11_KI270721v1_random 100316 18375 31042 31012 19887 0 3394
.
.
.
total 3209286105 898285419 623727342 626335137 900967885 159970322 30979743
For your purpose, first two columns would suffice.
$ pip install pyfaidx
$ faidx -i chromsizes input.fa > output.chromsizes
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
what do you mean with "querying" ? http ? mysql ? this is a small file. what's wrong with having a local copy ? or do you have a local copy of the FASTA sequences ?
I would like to do it based on the fasta files they provide in the FTP.
The problem is that known of the APIs said on which files these numbers are compute.
Do you know how to do it from the fasta files ensembl provides here: https://m.ensembl.org/info/data/ftp/index.html
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.