Tutorial:Which human reference genome should I use?
3
55
Entering edit mode
6.1 years ago

"Which human reference genome should I use?" is one of the top questions arising on biostars. On the first look the answer to this seems to be quiet simple: Take the most recent major version from the primary source, which is GRCh38 and you can obtain it on the website of the Genome Reference Consortium.

But before you blindly go on and download the genome file let's clear the following questions:

  1. What are reasons to take an older release?
  2. What sequence information are included in the reference fasta?
  3. What about the latest minor version?
  4. Which steps are neccessary to prepare the reference fasta for alignment?

1. What are reasons to take an older release?

Before starting check the databases you plan to use in your experiment. Are the data available for the latest assembly? How much effort do I need to put in to move those data to the latest version if they are not available yet?

2. What sequence information are included in the reference fasta?

The files for the latest major release are located here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38

The sequences are in GCA_000001405.15_GRCh38_genomic.fna.gz

This file contains the following sequence information:

  1. chromosomes 1-22, X, Y
  2. sequences that can be asigned to one chromosome, but not to an exact position or orientation (unlocalized sequences)
  3. sequences that cannot be assigned to any chromosome (unplaced sequences)
  4. the mitochondrial genome
  5. sequences that provides an alternate representation of a locus (alernate locus)

(1)-(3) build together the so-called Primary Assembly.

3. What about the latest minor version?

You may realize that there are several minor versions indicate by .p<version-number>. The latest of these is GRCh38.p12. So why I did not link to this version?

In a minor version the sequences included in the major one stays untouched. Instead there were sequences added called patches (That's where the p comes from). These patches could be fixes to existing sequences and will be incorporate into the primary assembly in the next major release. Or they represent new sequences, which will be called "alternate loci" in the next major release.

More about patches you can read in this FAQ.

Unless we have very good reasons for it, we only want to use the sequences of the primary assembly and the mitochondrial genome for alignment. Why? Read the next chapter.

4. Which steps are neccessary to prepare the reference fasta for alignment?

The most things I describe in this part are inspired by a blog post by Heng Li (the guy who gave us well known tools like bwa and samtools).

The reference sequence I've linked above is not ready-to-use for an alignment, due to this three things:

  1. It includes alternate representation of loci. These are highly similar to those in the primary assembly. That's why most aligners will not know where to place our reads and will give them a very low mapping quality (which is bad for variant calling).
  2. On the chromosome Y there is a pseudo-autosomal region (PAR), which is essentially a copy of a region located on chromosome X. Again our aligners will not know where to place the reads.
  3. The sequences have GenBank Accession Numbers like CM000663.2 as names instead of 1.

The ambigous mapping described in (1) is (2) is also valid for the patch sequences. This is why we don't need to take the latest minor version for alignment.

So what we have to do is:

  1. Build a subset of the sequences provided by GRC
  2. Mask the PAR on chromosome Y
  3. Rename the sequences

To fulfill these taskes we need some more files

Build a subset of the sequences provided by GRC

After downloading the reference sequence unpack it and index with samtools faidx:

$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna

From GCA_000001405.15_GRCh38_assembly_report.txt we need to extract the names for the primary assemblies and the mitochondrial genome. A lot of bioinformatic tools expact that sequences are ordered. So we have to sort the current names by the sequence names we want to use later.

$ sort -k1,1V GCA_000001405.15_GRCh38_assembly_report.txt|awk -v FS="\t" '$8 == "Primary Assembly" || $8 == "non-nuclear" {print $5}' > subset_ids.txt

With this list we can build the subset of the sequences:

$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna -r subset_ids.txt -o GRCh38_subset.fa

Mask the PAR on chromosome Y

(EDIT: I had to rewrite this paragraph. In the former version I told to us the par_align.gff directly for masking. This was wrong, as this files contains the PAR on the X chromosome.)

The par_align.gffcontains the information where the PAR is located on the X chromosome and where its equivalent is located on the Y chromosome. Open the file and look for the Target keyword in each line. The values are the position on the Y chromosome we like to mask. To do this, we must create a bed file with theses values (Caution: gff has 1-based positions, bed has 0-based). In our case it's looking like this:

$ cat parY.bed
CM000686.2  10000   2781479
CM000686.2  56887902    57217415

For the lazy ones here a one-liner to get the bedfile ;)

$ sed -E 's/.*Target=([^;]+).*/\1/g' par_align.gff|awk -v OFS="\t" '$0 !~ "^#" {print $1, $2-1, $3}'  > parY.bed

The masking can be done with bedtools:

$ bedtools maskfasta -fi GRCh38_subset.fa -bed parY.bed -fo GRCh38_subset_masked.fa 

Rename the sequences

Currently the header of each sequence looks like this:

>CM000663.2

What we like to have is this:

>1 CM000663.2 NC_000001.11 chr1

Everything until the first whitespace will be the id. What comes after that is just additional information.

$ awk -v FS="\t" 'NR==FNR {header[">"$5] = ">"$1" "$5" "$7" "$10; next} $0 ~ "^>" {$0 = header[$0]}1' GCA_000001405.15_GRCh38_assembly_report.txt GRCh38_subset_masked.fa > GRCh38_alignment.fa

Indexing

Now GRCh38_alignment.fa is ready for use in alignment. I would recommend index this file with samtools faidx, as there are a lot of tools out there which need it. Also you need to index it in a way that is described for the aligner you like use. E.g. for bwa type: bwa index GRCh38_alignment.fa

fin swimmer


EDIT: I moved my comments to the questions by Bastien Hervé to answers, as I think they are of general interest. So they don't get lost between the comments.

assembly genome alignment • 16k views
ADD COMMENT
0
Entering edit mode

Thanks for the hard work finswimmer !

Would have be cool to have some words about GRC genome, Ensembl genome, NCBI genome... And talk about associated annotations which could be a bit different. Also chromsome names could be different (chr1, 1, NC_000001.11...)

Could I be wrong but, is not it dangerous to delete non-primary sequences from reference genome if you are interested in specific base information, like variant discovery. Because even if a sequence is a patch, it exists. Let's say you sequenced this patch sequence of chr6, if you remove non-primary sequences from your reference genome the read will map on the chr6 itself, which could lead to misscall some variants in the patch region.

Never heard that the sequence on chrY could be a problem, do you have some documentation related to this and what to do with this location ?

ADD REPLY
0
Entering edit mode

Some typo :)

Part 3) The latest of these ist

Part4) Therefor most

ADD REPLY
0
Entering edit mode

Fixed :)

Concerning your questions: I'm short in time now. Will answers later (if no one else does ;) ) For the PAR you could start reading on wikipedia in the meantime.

ADD REPLY
0
Entering edit mode

The "you could start reading on wikipedia" sound like "search on google yourself" to me :(

I don't really need information about the region, I'm just wondering why you mask this part and not all the duplicated regions of the human genome ?

ADD REPLY
0
Entering edit mode

Hello Bastien Hervé ,

I'm very sorry if it sounded that harsh. It wasn't meant this way. I really like your comments.

I especially mask this part because it's quite huge and it included genes (that are also of clinical interests). And in case of a female sample you can be sure that all reads have it's origin from chromosome X. So a low mapping quality because of the duplicate region on chromosome Y would be false.

fin swimmer

ADD REPLY
0
Entering edit mode

Maybe I'm diverging and nitpicking...

Let's imagine you got a female patient XY without SRY gene, it is very rare OK, but could append (here we are talking about gender on biostars :D)... I don't know if a karyotype is created before the acceptation of a patient. If we are not talking about human, I don't know if this event is frequent in mouse by exemple.

For full female patient OK you can mask this part of the Y (if we do not care about genetic anomaly). But for both gender or full man patient pool, if you mask the Y part, reads will map to the X PAR locus whereas half the reads could belongs to X and the other half to Y. If you don't want these reads you have to mask both location (PAR on X and Y).

Anyhow if I remember well, aligners do not output reads mapped on multiple positions, which is the case for reads in PAR locus.

ADD REPLY
4
Entering edit mode
6.1 years ago

Why one shouldn't include ALT_Loci or Patch sequences in alignment reference?

As a rule of thumb: If you never heard about ALT_Loci or patch sequences, than it is very likely you don't need them :)

ALT_Loci sequences describes alternative sequence version for a specific region. There seems to be no clear definition, when it becomes an alternative (and get it's own identification number) and when it is just a variant and is therefor just reported in a variant database.

The sequences are often padded by long stretches of bases that are identical with the primary assembly. This is a problem when using a mapper/aligner that is not ALT_Loci aware. Reads will get a very low mapping quality as they can map to the primary assembly and the ALT_Loci. Saying this you are unable to do variant calling for this regions, as reads that can map to multiple position are usualy dropped.

There are ALT aware aligner out there, e.g. bwa. I've found a tutorial here. In the special case, you are interested in a loci, where you know there ALT sequences, one can use this. In the most cases this is not manageable, as there are no mainstream tools (that I be aware) for variant calling that be aware of the alternate loci.

All the things above are valid for patches as well. Here the differences to the primary assembly can be much smaller.

fin swimmer

ADD COMMENT
1
Entering edit mode

For counting I see no problem to delete the ALT_Loci and Patch sequences

For variant calling :

Reads will get a very low mapping quality as they can map to the primary assembly and the ALT_Loci. Saying this you are unable to do variant calling for this regions, as reads that can map to multiple position are usualy dropped.


Read     : AGCGTAC

Primary  : AGCCTAC

Alt_loci : AGCGTAC

If you keep alt_loci, the read will be multi mapped so it will be threw away


Read     : AGCGTAC

Primary  : AGCCTAC

If you delete alt_loci, the read will be align to the primary, leading to a variant C/G, which is not a real mutation (read is coming from Alt_loci).

I prefer throwing a read away than calling a false SNP.

ADD REPLY
1
Entering edit mode

Hello Bastien Hervé ,

what you are missing in your example is: If you're working with with ALT_loci you have to report that you've found such a loci in your sample.

So in your example the read is mapped to an ALT_Loci and have no mismatch with the reference sequence. But in your report you have to say: I've found out that in my sample the regions is represented by in ALT_loci sequence. In comparison to the REF_Loci the difference is in G on position x instead in C with the consequence ...

I've found this whole ALT_loci very interesting and currently I try to find how to use it with bwa. But unfortunately the documentation is very poor. In the end the main problem persist: You do not only need an ALT aware aligner, you also need an ALT aware variant calling and reporting pipeline.

fin swimmer

ADD REPLY
3
Entering edit mode
6.1 years ago

Compare reference genomes from different sources

I've done some comparison between the reference genome from GRC and ensembl. When doing such a comparison these are the points one should consider:

  1. Which sequences are included?
  2. How are the sequence named?
  3. Are any regions hard- (or soft) masked?

The latest reference sequences from ensembl are available here: ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/

The README describes what's the difference between the several files. I choose Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

First thing one should do, is to uncompress and index the file.

$ gunzip -c Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > Homo_sapiens.GRCh38.dna.primary_assembly.fa
$ samtools index Homo_sapiens.GRCh38.dna.primary_assembly.fa

If you now take a look into the index file, you will find out that

  • It includes the sequences for chromosome 1-22, X, Y, MT, named without the prefix chr, ordered in lexicographically order.
  • These sequences ar followed by unplaced and unlocalized sequences named with their genebank accession number.

To check, wether there are sequence changes between the sequences provided by GRC und ensembl, I first sorted the sequences by sequence length and run than diff on them. The results are:

  • All bases in the ensembl file have capitial letters, whereas in the GRC file repeative sequences are soft-masked (meaning have lowercase).
  • The PAR on chromsome Y is hard-masked (meaning have N's)
  • Semi-ambiguous IUB codes were converted to “N” (these are very few, so this is a very minor issue)

In summary: The Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz provided by ensembl fulfill all criteria for a reference sequence for alignment I described in my initial post.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks finswimmer for the add on of the Ensembl reference genome.

The result of the diff command is interesting but i'm still not convinced by the mask of the PAR of the Y chromosome. If you want to do a counting on the PAR of the X chromosome for a male patient your results will be wrong. The real reads from the PAR of the Y chromosome will be force to align on the PAR of the X chromosome, which can lead to false positive counts

Also, I'm interested in your opinion about my second point (deleting non-primary sequences for SNP discovery) :)

ADD REPLY
1
Entering edit mode

Hey,

I don't forget your other questions. I'm working on it to provide the best possible answer ;)

ADD REPLY
3
Entering edit mode
6.1 years ago

Why one must masked the PAR region for alignment?

Let's do some biology to understand this. During meiosis there is a process call crossing over. During this parts of homologues chromosomes are exchange between each other. But what about the case of X and Y chromosome as they don't build a regular pair? Crossing over happens here as well! The regions where this happens are called pseudoautosomal region (PAR). "pseudoautosomal" because these regions are located on the gonosomes, but follow the same mendelian rules like autosoms.

Let's come back to our reference genome. This assumes a haploid genome. We don't have sequences for each single chromosome in our diploid genome, that we usually sequence. After alignment it is impossible (or not easy) to differ from which homologues chromosome a read comes from. The PAR regions behave like homologues chromsomes during meiosis. That's why we have to handle this regions during mapping in the same way, meaning providing just one sequence location. So we hard masked the region on chromosome Y, forcing all reads get mapped to chromsome X and get a good mapping quality, which is neccessary for variant calling.

But how could we differ if a variant within the PAR is located on X or Y? The answer is: We cannot! Even with sanger sequencing this is not possible. And in most cases this will not be neccessary, as variants in these regions behave just like any other variant on the autosoms. If we really need to know the location, we had to sequence the parents and let's hope, that the variant is derived from the mother or the father.

In the blog by Heng Li I've linked to in the OP, Heng Li give the hint, that there are also alpha satellits that behave like PAR, meaning that there are crossing over events between non-homologues chromsomes. I find it quite hard to find information about this. There is a file in the seqs_for_alignment_pipelines.ucsc_ids folder by GRCh38 in which are coordinates for these regions. But I didn't find out where these comes from. So if anybody knows, please tell us.

To summarize: PAR is not a simply duplication of a region. Instead PAR can have multiple locations. To be able to do variant calling, we have to make sure to provide only one location as a reference sequence.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks for the detailed answer ! As I can understand masking PAR region of Y chromosome for variant calling does not lead to false detection. For counting, using the fact that these regions are similar, if you do not mask PAR region on Y you will have bad alignment on these regions (multi mapping). If you mask PAR region on Y, you will have the count of PAR X + PAR Y, so the count result will be not relevant... In conclusion in anyway (masked or not) you cannot do a proper counting on these PAR regions. Correct ? Thanks a lot for your time :)

ADD REPLY

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6