Question

Human dna reference file with no prefix 'chr'

10

Entering edit mode

10.1 years ago

mangfu100 ▴ 810

Hi.

I have usually been using hg19 reference file from ucsc.

However, ucsc has a prefix 'chr' in its files.

Of course, I am able to exclude 'chr' prefix by script programming, but I don't want to.

Instead, I need another human reference file which does not contain 'chr' prefix itself.

Do any another reference files exist not ucsc?

next-gen-sequencing • 28k views

ADD COMMENT • link updated 2.5 years ago by Cyriac Kandoth 6.1k • written 10.1 years ago by mangfu100 ▴ 810

0

Entering edit mode

The GATK bundle has what you need

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Is there a reason why you wanna bypass something as simple as a sed 's/chr//g' ucsc.hg19.fa >ucsc.hg19.nochr.fa?

ADD REPLY • link 5.2 years ago by Ram 44k

1

Entering edit mode

I think that just simply excluding of 'chr' would make any problem later for processing.

and I also though that users must match the reference file that was used for mapping program such as bwa with others variation calling program.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by mangfu100 ▴ 810

15

Entering edit mode

10.1 years ago

Dan D 7.4k

Why would you go through all the trouble of finding/downloading/cat'ing another reference? Getting rid of that chr designation is dead simple.

For FASTA files:

cat hg19.fa | sed 's/>chr/>/g' > hg19_new.fa

For BED files:

cat myhg19Genes.bed sed 's/^chr//' > myhg19Genes_new.bed

For BAM files:

samtools view -h hg19Alignments.bam | sed 's/chr//g' | samtools view -Shb - -o hg19Alignments_new.bam

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Dan D 7.4k

2

Entering edit mode

For BAM files, it might be faster to modify the header with sed, write that to a file, and then use samtools reheader (there's no need to then parse everything to/from sam). Of course, that doesn't fit as nicely into a single command as your solution :)

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by Devon Ryan 104k

1

Entering edit mode

Modifying only the header is also safer, so that you don't risk truncating base quality values that happen to form the word chr (very unlikely, but still possible). But even in the header, you risk changing comment lines in the @CO tag, or file names or commands in the @PG tag, containing words like "chrom", "chromosome", or "jesus christ! why is this so complicated?!"

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by Cyriac Kandoth 6.1k

5

Entering edit mode

5.2 years ago

Max Masnick ▴ 50

For hg38/GRCh38, the answer appears to be to use the Ensembl reference sequence. This does not have the chr prefix.

The GATK reference sequence from their Resource Bundle has the chr prefix.

Note that if you are using the Ensembl reference sequence with GATK, you'll need to do some extra steps to generate .dict and .fai files. These are available for the GATK Resource Bundle FASTAs, but not for Ensembl.

Do not try to use sed or other find/replace tools to fix this problem yourself (unless you are extremely bandwidth-constrained or no option to download a properly-formatted reference FASTA). Cyriac Kandoth is absolutely right that this is a can of worms in many cases, and just not worth it when a reference file without chr is readily available.

ADD COMMENT • link 5.2 years ago by Max Masnick ▴ 50

1

Entering edit mode

Just a follow up to this question- I'm trying to transition over to GRCh38 (for everything in my pipeline; align, exome variant call, vep annotation, rna-seq, etc). I'm stuck at the GATK resource bundle (which I use for the BQSR), as it seems they only have hg38 resource bundle released (https://console.cloud.google.com/storage/browser/broad-references/hg38/v0), but this is following the ucsc format (with the "chr" prefix), as you've mentioned. From https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use, Heng suggested to use a no-alt set with hard mask for the PARs. Is this still recommended, and does the ensembl have versions of this (along with possibly also decoys)? I also see that 1000 genomes have a reference as well, that has the PARs and the decoys, but they also have the chr prefix.

ADD REPLY • link 5.1 years ago by leptonsonic ▴ 10

Ram · Accepted Answer · 2014-11-12

20

Entering edit mode

10.1 years ago

Cyriac Kandoth 6.1k

It's never as simple as "remove chr-prefix with a script". hg19 is UCSC's variant of the official GRCh37 assembly. Early releases of GRCh37 like GRCh37-lite, did not use chr-prefixes, but newer releases like GRCh37.p13 adopted the chr-prefix, and use a newer mitochondrial (MT) sequence than hg19 does. Note also how chrM in hg19 is named MT in GRCh37. And all the unplaced contigs have very different names. So simply removing the chr-prefix in hg19 does not make it GRCh37. It makes it a wholly other chromosome naming convention, which is the last thing we need right now, since everyone has agreed to use chr-prefixes in GRCh38/hg38, except Ensembl.

Update (Nov 4, 2016): Here is a UCSC Chain mapping UCSC's hg19 to Ensembl's GRCh37.p13 (no chr-prefix), compatible with tools like CrossMap, Remap, or liftOver. Notice how all chromosomes/contigs except chrM only require renaming. Users of vcf2maf with hg19 VCFs as input, can pass this into the --remap-chain argument.

ADD COMMENT • link 2.5 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

It's never as simple as "remove chr-prefix with a script". That doesn't make sense, because sometimes it really is that simple. There have been plenty of occasions where a simple find/replace is all I've needed to do to get a BED file working in tandem with a BAM file in a downstream-analysis step (after ensuring I was indeed working from the same reference).

My pedantry aside, the info you provided was very useful.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by Dan D 7.4k

0

Entering edit mode

Agreed... it's often simple to just "get something working", night before the lab meeting. ;) But it's a whole other ballgame when you're trying to develop robust bioinformatics pipelines that can handle every pedantic detail.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.0 years ago by Cyriac Kandoth 6.1k