Question

1000 genomes technical data: match exon capture probes to samples

0

Entering edit mode

9.3 years ago

eric.kern13 ▴ 240

I am studying how capture probe properties affect read depth in targeted exon sequencing. I am using data from the 1000 Genomes Project. My specific question: which exome sequencing/exon capture BAM files were generated using Nimblegen probes, and which used Agilent probes?

I've read their FTP tutorial, their paper, and the supplementary info, and I've spent a lot of time on the FTP site. The closest I found was this README, which says which centers use which probe sets. (I can't find which ceters made which BAMs, though).

In general, switching between their paper, supplementary materials, README files, BAS files and tutorials makes it easy to leave gaps. This is a second-priority question, but in the future, is there a single resource that I can go to for technical questions about the 1000 Genomes project?

Thanks for your help.

next-gen • 2.5k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by eric.kern13 ▴ 240

0

Entering edit mode

This FAQ page provides a little more detail on the exome capture than the README you linked to.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by donfreed ★ 1.6k

Ram · Answer 1 · 2015-08-13

0

Entering edit mode

9.3 years ago

Adam ★ 1.0k

I suggest you email info@1000genomes.org. They are pretty responsive.

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Adam ★ 1.0k

0

Entering edit mode

Thanks. I did that first. I wasn't sure how long it would take, so I posted here too.

ADD REPLY • link 9.3 years ago by eric.kern13 ▴ 240

Ram · Answer 2 · 2015-08-15

In case anyone stumbles on this post, here is the impressively detailed answer from Holly Zheng-Bradley at 1000G. This paper may also be worth a look for people using 1000G data.

The 1000 Genomes Project exome sequence data were created by different sequencing centres using different exome pulldown platforms. Below list the centre abbreviation and the pulldown platform they used:

-- BGI: NimbleGen v1 2.1M_Human_Exome -- BI/WUGSC: Agilent SureSelect_All_Exon_V2

To look for exome BAM files made from data created by specific pulldown platform, you may use our latest sequence index file as a starting point and look for samples that have exome data produced by a specific sequencing centre. If all exome data for a samples is produced by one centre, we know the exome BAM file for that sample is based on data from pulldown platform used by that centre.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.analysis.sequence.index

Run a command line like below:
$ less 20130502.phase3.analysis.sequence.index | grep exome | cut -f3,5,6,10,13,26 | sort -u | sort -k4 | less

........ ........ ERR047782 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA
exome ERR047783 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA exome
ERR047784 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA exome
........
Using HG01990 as example, basically you see that sample HG01990 is exome sequenced by BGI (only), so the pulldown platform is NimbleGen v1 2.1M_Human_Exome. Of course you need to make sure HG01990 is not exome sequenced by other centres (which shouldn't happen), because our sample level BAMs are made by combining all available exome runs.

To get the exome BAM file for HG01990, you look into our alignment index file:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.exome.alignment.index
$ grep HG01990 20130502.phase3.exome.alignment.index | cut -f1
data/HG01990/exome_alignment/HG01990.chrom11.ILLUMINA.bwa.ACB.exome.20121211.bam
data/HG01990/exome_alignment/HG01990.chrom20.ILLUMINA.bwa.ACB.exome.20121211.bam
data/HG01990/exome_alignment/HG01990.mapped.ILLUMINA.bwa.ACB.exome.20121211.bam
data/HG01990/exome_alignment/HG01990.unmapped.ILLUMINA.bwa.ACB.exome.20121211.bam
Some additional information: for downstream analysis, instead of using separate pulldown bed file for each platform, the project used an Exome Project Consensus BED files created by taking the union between the capture design files used by different production centres (BGI, BI, and WUGSC) and CCDS. This version was built based on the GRCh37.1 (NCBI HG19) reference sequences.