"Which human reference genome should I use?" is one of the top questions arising on biostars. On the first look the answer to this seems to be quiet simple: Take the most recent major version from the primary source, which is GRCh38 and you can obtain it on the website of the Genome Reference Consortium.
But before you blindly go on and download the genome file let's clear the following questions:
- What are reasons to take an older release?
- What sequence information are included in the reference fasta?
- What about the latest minor version?
- Which steps are neccessary to prepare the reference fasta for alignment?
1. What are reasons to take an older release?
Before starting check the databases you plan to use in your experiment. Are the data available for the latest assembly? How much effort do I need to put in to move those data to the latest version if they are not available yet?
2. What sequence information are included in the reference fasta?
The files for the latest major release are located here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38
The sequences are in GCA_000001405.15_GRCh38_genomic.fna.gz
This file contains the following sequence information:
- chromosomes 1-22, X, Y
- sequences that can be asigned to one chromosome, but not to an exact position or orientation (unlocalized sequences)
- sequences that cannot be assigned to any chromosome (unplaced sequences)
- the mitochondrial genome
- sequences that provides an alternate representation of a locus (alernate locus)
(1)-(3) build together the so-called Primary Assembly.
3. What about the latest minor version?
You may realize that there are several minor versions indicate by .p<version-number>
. The latest of these is GRCh38.p12. So why I did not link to this version?
In a minor version the sequences included in the major one stays untouched. Instead there were sequences added called patches (That's where the p
comes from). These patches could be fixes to existing sequences and will be incorporate into the primary assembly in the next major release. Or they represent new sequences, which will be called "alternate loci" in the next major release.
More about patches you can read in this FAQ.
Unless we have very good reasons for it, we only want to use the sequences of the primary assembly and the mitochondrial genome for alignment. Why? Read the next chapter.
4. Which steps are neccessary to prepare the reference fasta for alignment?
The most things I describe in this part are inspired by a blog post by Heng Li (the guy who gave us well known tools like bwa
and samtools
).
The reference sequence I've linked above is not ready-to-use for an alignment, due to this three things:
- It includes alternate representation of loci. These are highly similar to those in the primary assembly. That's why most aligners will not know where to place our reads and will give them a very low mapping quality (which is bad for variant calling).
- On the chromosome Y there is a pseudo-autosomal region (PAR), which is essentially a copy of a region located on chromosome X. Again our aligners will not know where to place the reads.
- The sequences have GenBank Accession Numbers like
CM000663.2
as names instead of1
.
The ambigous mapping described in (1) is (2) is also valid for the patch sequences. This is why we don't need to take the latest minor version for alignment.
So what we have to do is:
- Build a subset of the sequences provided by GRC
- Mask the PAR on chromosome Y
- Rename the sequences
To fulfill these taskes we need some more files
- GCA_000001405.15_GRCh38_assembly_report.txt for building the subset and renaming
- GCA_000001405.15_GRCh38_assembly_structure -> Primary_Assembly -> pseudoautosomal_region -> par_align.gff to mask PAR
Build a subset of the sequences provided by GRC
After downloading the reference sequence unpack it and index with samtools faidx
:
$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna
From GCA_000001405.15_GRCh38_assembly_report.txt
we need to extract the names for the primary assemblies and the mitochondrial genome. A lot of bioinformatic tools expact that sequences are ordered. So we have to sort the current names by the sequence names we want to use later.
$ sort -k1,1V GCA_000001405.15_GRCh38_assembly_report.txt|awk -v FS="\t" '$8 == "Primary Assembly" || $8 == "non-nuclear" {print $5}' > subset_ids.txt
With this list we can build the subset of the sequences:
$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna -r subset_ids.txt -o GRCh38_subset.fa
Mask the PAR on chromosome Y
(EDIT: I had to rewrite this paragraph. In the former version I told to us the par_align.gff
directly for masking. This was wrong, as this files contains the PAR on the X chromosome.)
The par_align.gff
contains the information where the PAR is located on the X chromosome and where its equivalent is located on the Y chromosome. Open the file and look for the Target
keyword in each line. The values are the position on the Y chromosome we like to mask. To do this, we must create a bed
file with theses values (Caution: gff
has 1-based positions, bed
has 0-based). In our case it's looking like this:
$ cat parY.bed
CM000686.2 10000 2781479
CM000686.2 56887902 57217415
For the lazy ones here a one-liner to get the bed
file ;)
$ sed -E 's/.*Target=([^;]+).*/\1/g' par_align.gff|awk -v OFS="\t" '$0 !~ "^#" {print $1, $2-1, $3}' > parY.bed
The masking can be done with bedtools
:
$ bedtools maskfasta -fi GRCh38_subset.fa -bed parY.bed -fo GRCh38_subset_masked.fa
Rename the sequences
Currently the header of each sequence looks like this:
>CM000663.2
What we like to have is this:
>1 CM000663.2 NC_000001.11 chr1
Everything until the first whitespace will be the id. What comes after that is just additional information.
$ awk -v FS="\t" 'NR==FNR {header[">"$5] = ">"$1" "$5" "$7" "$10; next} $0 ~ "^>" {$0 = header[$0]}1' GCA_000001405.15_GRCh38_assembly_report.txt GRCh38_subset_masked.fa > GRCh38_alignment.fa
Indexing
Now GRCh38_alignment.fa
is ready for use in alignment. I would recommend index this file with samtools faidx
, as there are a lot of tools out there which need it. Also you need to index it in a way that is described for the aligner you like use. E.g. for bwa
type: bwa index GRCh38_alignment.fa
fin swimmer
EDIT: I moved my comments to the questions by Bastien Hervé to answers, as I think they are of general interest. So they don't get lost between the comments.
The following paper may be of interest: A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification.
Thanks for the hard work finswimmer !
Would have be cool to have some words about GRC genome, Ensembl genome, NCBI genome... And talk about associated annotations which could be a bit different. Also chromsome names could be different (chr1, 1, NC_000001.11...)
Could I be wrong but, is not it dangerous to delete non-primary sequences from reference genome if you are interested in specific base information, like variant discovery. Because even if a sequence is a patch, it exists. Let's say you sequenced this patch sequence of chr6, if you remove non-primary sequences from your reference genome the read will map on the chr6 itself, which could lead to misscall some variants in the patch region.
Never heard that the sequence on chrY could be a problem, do you have some documentation related to this and what to do with this location ?
Some typo :)
Part 3) The latest of these ist
Part4) Therefor most
Fixed :)
Concerning your questions: I'm short in time now. Will answers later (if no one else does ;) ) For the PAR you could start reading on wikipedia in the meantime.
The "you could start reading on wikipedia" sound like "search on google yourself" to me :(
I don't really need information about the region, I'm just wondering why you mask this part and not all the duplicated regions of the human genome ?
Hello Bastien Hervé ,
I'm very sorry if it sounded that harsh. It wasn't meant this way. I really like your comments.
I especially mask this part because it's quite huge and it included genes (that are also of clinical interests). And in case of a female sample you can be sure that all reads have it's origin from chromosome X. So a low mapping quality because of the duplicate region on chromosome Y would be false.
fin swimmer
Maybe I'm diverging and nitpicking...
Let's imagine you got a female patient XY without SRY gene, it is very rare OK, but could append (here we are talking about gender on biostars :D)... I don't know if a karyotype is created before the acceptation of a patient. If we are not talking about human, I don't know if this event is frequent in mouse by exemple.
For full female patient OK you can mask this part of the Y (if we do not care about genetic anomaly). But for both gender or full man patient pool, if you mask the Y part, reads will map to the X PAR locus whereas half the reads could belongs to X and the other half to Y. If you don't want these reads you have to mask both location (PAR on X and Y).
Anyhow if I remember well, aligners do not output reads mapped on multiple positions, which is the case for reads in PAR locus.