So this is where a bit of history of the different sequencing projects done for the 1000 Genomes Project.
First, it is important to remember that the project was aiming to cover as much normal variation as possible so was almost entirely sequencing unrelated individuals, cell lines were created for children of trios and duos but they have only been genotyped rather than sequenced for the most part.
The pilot project used three different sequencing strategies, High coverage (where the CEU and YRI trios were sequenced). Low coverage and exon targeted (This also included both trios)
Phase1 was just low coverage and whole exome sequencing. Neither Trio was sequenced in phase1, here we were aiming for unrelated individuals (there were some cryptic relations discovered after the sequencing was done)
Phase3 did include 3 trios at the sequencing stage (YRI was one of these). We also included a subsampled version of NA12878 (subsampled from PCR free high coverage data) for validation purposes. When making the final release we excluded all samples who were at least 2nd degree relations of other samples in the set, this was to avoid double counting alleles when calculating allele frequencies. As already pointed out we put these additional samples in the related_samples directory on the FTP site ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/related_samples_vcf/
It is worth noting that it is a bad idea to use the original pilot high coverage data, this was generally sequenced in 2008 and is 36bp data from GAIIs. If you want high quality data for the CEU Trio it is a good idea to look at the Illumina Platinum Genomes data
https://www.illumina.com/platinumgenomes.html
Laura, I am sorry that I am seeing your message only now.
Thanks a lot for your thorough answer, it is very enlightening, and for the link to the Platinum Genomes data.
Hello lh3
thanks for your answer
NA19240 is part of the "related samples" dataset (sample text file) and can be found here.
which is not the case for NA12891 and NA12892
As I remember, these are samples included by mistake. They were identified and removed later. 1000G didn't want to include them in the first place.
Ok, thanks. Then it seems weird to me that they included the parents for one of the trio (YRI) and the child for the other trio (CEU), instead of the parents for both trios
I have not followed how this decision was made, but it is easy to make a guess. NA12878 is included as a control, so her parents are excluded at an early stage. The YRI trio was included by mistake, so they removed the daughter to retain more data.
Why would they not want to include them?
See this answer: A: Why the CEU trio's parents NA12891 and NA12892 are not in 1000 Genomes phase 3