Karyotypically Sorted Genome Assembly
2
5
Entering edit mode
13.0 years ago
blackgore ▴ 60

How does GATK define their requirement of ‘karyotypic’ sorting? In the simple case of the UCSC human assembly, yes it is to place the chromosomes in numerical order 1-22, followed by X,Y and M, but where does one place the additional unlocalised contigs that are present in, for example, the Ensembl assembly? This was brought up as a comment in a related post.

assembly ensembl gatk • 5.2k views
ADD COMMENT
4
Entering edit mode
13.0 years ago

The Broad bundles, mentioned in your referenced post, have them pre-sorted for you and include all of the unlocalized and unplaced contigs. If you are building your own you can download the index and directly examine the ordering:

$ wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/hg19/ucsc.hg19.fasta.fai.gz
$ gunzip ucsc.hg19.fasta.fai.gz
$ cat ucsc.hg19.fasta.fai | cut -f 1
[...]
chr22
chrX
chrY
chr1_gl000191_random
chr1_gl000192_random
chr4_ctg9_hap1
chr4_gl000193_random
chr4_gl000194_random
chr6_apd_hap1
chr6_cox_hap2
chr6_dbb_hap3
[...]
ADD COMMENT
1
Entering edit mode

Sorry, I missed that you were interested in the GRCh37/Ensembl naming. Broad also has a bundle for that; it's definitely worth digging on their FTP site before creating your own: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/b37/human_g1k_v37.fasta.fai.gz

ADD REPLY
0
Entering edit mode

Thanks Brad for the quick response. I've done as you suggested, and mapped the ordering of the files from Broad to the Ensembl release. There is a greater number of contigs listed in the Ensembl file than the .fai file mentioned above, but it's a great place to start.

ADD REPLY
0
Entering edit mode

Thanks Brad, I'd not realised that Broad supplied sets for the Ensembl release too. Thanks for your advice!

ADD REPLY
2
Entering edit mode
11.1 years ago

The reference FASTAs downloadable from Ensembl are not karyotypically sorted, but that can be fixed with a little bit of Perl...

Download and unzip the Build37 (hg19) reference FASTA file from Ensembl release 72:

curl -LO ftp://ftp.ensembl.org/pub/release-72/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.72.dna.primary_assembly.fa.gz

gunzip Homo_sapiens.GRCh37.72.dna.primary_assembly.fa.gz

Use this Perl one-liner that splits the FASTA file per chrom/contig into a temporary folder, and then concatenates them in karyotypic order:

perl -e 'use File::Temp qw/tempdir/; use IO::File; $d=tempdir; $fh; map{if(m/^\>(\S+)\s/){$fh=IO::File->new(">$d/$1.fa");} print $fh $_;}`cat Homo_sapiens.GRCh37.72.dna.primary_assembly.fa`; foreach $c(1..22,X,Y,MT){print `cat $d/$c.fa`}; print `cat $d/GL*`' > Homo_sapiens.GRCh37.72.dna.primary_assembly.reordered.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6