Question

Why the CEU trio's parents NA12891 and NA12892 are not in 1000 Genomes phase 3

1

Entering edit mode

9.4 years ago

flow.genet ▴ 20

Hello everyone.

The 1000 Genomes released great data. I am trying to understand why the parents NA12891 and NA12892 of the "famous" CEU trio are not in the phase3 1000 genomes panel (not even in the vcf file containing the 31 related individual set appart).

On the opposite their child NA12878 and the full YRI trio NA19238, NA19239, NA19240 are included in the phase 3.

To my understanding this is because the YRI trio have both low and high coverage data, which is not the case for the CEU parents (only high cov). Is it correct? And why such a choice?

Is it "safe" to merge the high coverage data for NA12891 and NA12892 to the phase3 vcf file?

Thank you

sequencing genome trio 1000Genomes • 7.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.4 years ago by flow.genet ▴ 20

1

Entering edit mode

Laura, I am sorry that I am seeing your message only now.

Thanks a lot for your thorough answer, it is very enlightening, and for the link to the Platinum Genomes data.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.3 years ago by flow.genet ▴ 20

0

Entering edit mode

Hello lh3

thanks for your answer

NA19240 is part of the "related samples" dataset (sample text file) and can be found here.

which is not the case for NA12891 and NA12892

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.4 years ago by flow.genet ▴ 20

0

Entering edit mode

As I remember, these are samples included by mistake. They were identified and removed later. 1000G didn't want to include them in the first place.

ADD REPLY • link 9.4 years ago by lh3 33k

0

Entering edit mode

Ok, thanks. Then it seems weird to me that they included the parents for one of the trio (YRI) and the child for the other trio (CEU), instead of the parents for both trios

ADD REPLY • link 9.4 years ago by flow.genet ▴ 20

0

Entering edit mode

I have not followed how this decision was made, but it is easy to make a guess. NA12878 is included as a control, so her parents are excluded at an early stage. The YRI trio was included by mistake, so they removed the daughter to retain more data.

ADD REPLY • link 9.4 years ago by lh3 33k

0

Entering edit mode

Why would they not want to include them?

ADD REPLY • link 7.8 years ago by jnowacki ▴ 110

0

Entering edit mode

See this answer: A: Why the CEU trio's parents NA12891 and NA12892 are not in 1000 Genomes phase 3

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

Ram · Answer 1 · 2015-09-11

4

Entering edit mode

9.4 years ago

lh3 33k

No, NA19240 is not part of phase3. 1000G includes unrelated individuals. If the child is included, the parents will be excluded, and vice versa.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.4 years ago by lh3 33k

0

Entering edit mode

Hello lh3

thanks for your answer

NA19240 is part of the "related samples" dataset (sample text file) and can be found here.

which is not the case for NA12891 and NA12892

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.4 years ago by flow.genet ▴ 20

Ram · Answer 2 · 2015-09-12

So this is where a bit of history of the different sequencing projects done for the 1000 Genomes Project.

First, it is important to remember that the project was aiming to cover as much normal variation as possible so was almost entirely sequencing unrelated individuals, cell lines were created for children of trios and duos but they have only been genotyped rather than sequenced for the most part.

The pilot project used three different sequencing strategies, High coverage (where the CEU and YRI trios were sequenced). Low coverage and exon targeted (This also included both trios)

Phase1 was just low coverage and whole exome sequencing. Neither Trio was sequenced in phase1, here we were aiming for unrelated individuals (there were some cryptic relations discovered after the sequencing was done)

Phase3 did include 3 trios at the sequencing stage (YRI was one of these). We also included a subsampled version of NA12878 (subsampled from PCR free high coverage data) for validation purposes. When making the final release we excluded all samples who were at least 2nd degree relations of other samples in the set, this was to avoid double counting alleles when calculating allele frequencies. As already pointed out we put these additional samples in the related_samples directory on the FTP site ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/related_samples_vcf/

It is worth noting that it is a bad idea to use the original pilot high coverage data, this was generally sequenced in 2008 and is 36bp data from GAIIs. If you want high quality data for the CEU Trio it is a good idea to look at the Illumina Platinum Genomes data

https://www.illumina.com/platinumgenomes.html