1, Download the files as VCF.gz (and tab-indices)

Question

Tutorial:Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

63

Entering edit mode

7.7 years ago

Kevin Blighe 89k

NB - Update July 29, 2020 - this thread will no longer be watched and, for all intents and purposes, will now be archived

NB - Version 2 of tutorial can be found here and should be used going forward --> Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

My own process for producing a PCA bi-plot from the 1000 Genomes Phase III is below. In R, you'll have to sort out the plot colours yourself. The PED file contains sample ID to ethnicity mappings.

The protocol changes if you want to merge your own data to the 1000 Genomes. I've done this many times and have developed my own predictive ethnic model (>99% sensitivity).

NB - requires at least plink 1.9

1, Download the files as VCF.gz (and tab-indices)

data no longer available

2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

3, Convert the 1000 Genomes files to BCF

Ensure that multi-allelic calls are split (1st part)
Left-align indels (2nd part)
Set the ID field to a unique value: CHROM:POS:REF:ALT (3rd part)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

 for chr in {1..22}; do
    bcftools norm -m-any ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
    bcftools norm --check-ref w -f human_g1k_v37.fasta | \
    bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
 done

4, Convert the BCF files to PLINK format

for chr in {1..22}; do
    plink --noweb --bcf ALL.chr$chr.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \
    --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed \
    --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

5, Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

6, Identify and remove any duplicates (by ID and REF / ALT encoding)

If you kept rs IDs in your input files (step 3), then duplicates (yes, they exist in 1000 Genomes and dbSNP!) will be found. If you change the ID to CHROM:POS:REF:ALT, then no duplicates will be found.

mkdir DupsRemoved ;

for chr in {1..22}; do
    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --list-duplicate-vars ids-only ;

    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --exclude plink.dupvar --make-bed \
    --out DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    rm plink.dupvar ;
done

7, Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb --bfile DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --maf 0.10 --indep 50 5 1.5 \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    plink --noweb --bfile DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.prune.in --make-bed \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;

9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;

10, Perform MDS (10 dimensions) and PCA

plink --bfile Merge --cluster --mds-plot 10

plink --bfile Merge --pca

11, Generate plots in R

options(scipen=100, digits=3)

#Read in the eigenvectors
eigen <- data.frame(read.table("plink.eigenvec", header=FALSE, skip=0, sep=" "))
rownames(eigen) <- eigen[,2]
eigen <- eigen[,3:ncol(eigen)]

summary(eigen)

#Determine the proportion of variance of each component
proportionvariances <- ((apply(eigen, 1, sd)^2) / (sum(apply(eigen, 1, sd)^2)))*100

plot(eigen[,1], eigen[,2])

Sample image

[need to manage colours and layout yourself]

1000genomes PLINK PCA VCF • 31k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 7.7 years ago by Kevin Blighe 89k

1

Entering edit mode

Can you please show, how to match the sample IDs on a vcf file to IDs on a bed file?

bed 55 29 19 40 58 28 42 45
vcf 42_42 58_58 40_40 45_45 55_55 19_19 28_28 29_29

thanks!

ADD REPLY • link 7.7 years ago by ATCG ▴ 400

1

Entering edit mode

That is a great question.

When you convert the VCF or BCF file to a PLINK BED file, PLINK does not always maintain the same ordering of samples as they appear in your VCF/BCF. It is good practice to always control the ordering using the --indiv-sort file parameter. You will also require a sample order file, which is just at list of the sample IDs as you would like them to be ordered:

plink --noweb --bcf MyData.bcf --keep-allele-order --indiv-sort file MySampleOrder.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyData ;

...where MySampleOrder.list is just a list of sample IDs in your order of preference:

HG01879
HG01880
HG01881
HG01882
HG01883
HG01888
HG01884
HG01885
HG01956
HG01887
HG02014
HG01889
HG01890
HG01891
HG01894
HG01895
HG01896
HG01897
...
HG02013
HG01985

Further information here: https://www.cog-genomics.org/plink/1.9/data#indiv_sort

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you so much!! I ended up doing this in R, and it was a lot of work. Will follow your recommendation in the future.

ADD REPLY • link 7.7 years ago by ATCG ▴ 400

0

Entering edit mode

Looks nice! Thanks for posting. Could I ask you to please include a PCA figure directly in your post?

ADD REPLY • link 7.7 years ago by Kamil ★ 2.3k

0

Entering edit mode

Thanks for the suggestion. I have now added a sample figure plotting component 1 versus 2 and component 1 versus 3. The user will have to organise the colours and plot layout themselves!

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, May I ask please, why is it necessary to filter by --maf 0.10, only retain SNPs with MAF greater than 10%? If you could point me to a good literary source I'd be grateful. I'm working with GTEx data and need to have PCs to be used as covariates in linear models. Thanks!

ADD REPLY • link 7.3 years ago by ATCG ▴ 400

1

Entering edit mode

If we just retain variants with low MAF, then every population (and sample) will be different because, by their very nature, low frequency variants are only present in a low number of samples. The algorithm 'fails' to find adequate relationships and then every sample is segregated from every other sample.

We want variants that are 'common' enough such that they are present across all groups, and we are interested in seeing how their genotypes change. Thus, the algorithm can find relationships and the different population groups become grouped together.

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

This was very helpful, thank you very much!

ADD REPLY • link 7.3 years ago by ATCG ▴ 400

0

Entering edit mode

How should one make use of the proportionvariances variable here? It seems that each sample has a corresponding value how can one summarize/interpret this result? Is it possible to plot? I have 200 samples.

ADD REPLY • link 7.3 years ago by ATCG ▴ 400

0

Entering edit mode

Thank you for the tutorial and it is really helpful. I would like to merge my data (in plink format) with the 1000 Genomes data and plot PCA. Should I merge the data after the plink format generated for 1000G? I would be grateful for your advice on this.

ADD REPLY • link 7.1 years ago by lw537 ▴ 10

1

Entering edit mode

Hello, you're welcome. Previously, I have merged them after they have been converted to PLINK format and also after I had done some 'pruning' of variants independently on each dataset. I do not see why merging them prior to PLINK would make much difference; however, there would be differences when performing the pruning steps due to the allele frequency cut-offs.

It's certainly not a standardised pipeline, so, there are various ways of doing it.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

1

Entering edit mode

Thank you very much for your advice.

ADD REPLY • link 7.1 years ago by lw537 ▴ 10

0

Entering edit mode

Thanks for the great tutorial, it was very helpful!

Since I have microarray data (Illumina) that I need to analyze: could you give any more information how you would approach the step Exclude variants not on the coding strand?

ADD REPLY • link 7.1 years ago by Alex • 0

1

Entering edit mode

I see! In that case, what you need to do is the following:

download the probe information for your microarray (you'll find it on the vendor's website, usually in TSV format)
obtain a list (as a single column text file) of probes that are genotyped on the coding strand (+)
use this list to filter include only these probes in your dataset. This can be done with the --extract PLINK command line parameter (see more info Here)
proceed as normal from there

Kevin

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for your quick reply!

I would have one more question if you would humor me - since I am still kind of coping with the basics, i'm afraid it's a little basic.

Why do you constrain the data only to the coding strand? In Illumina GSA this would be entries with "+" in the 'RefStrand' column in the Manifest file? There is also the "IlmnStrand" column, which has TOP, BOT, PLUS, MINUS as values. As far as I understand it, the Illumina annotations for TOP, BOT strand (https://www.illumina.com/documents/products/technotes/technote_topbot.pdf) the RefStrand together with TOP, BOT indicate only with respect to which strand the SNP is given. Wouldn't I be throwing out more than half the SNPs in this way?

ADD REPLY • link 7.1 years ago by Alex • 0

1

Entering edit mode

I don't know of the specifics of the illumina annotation, i.e., which exact column it is - sorry. Each vendor produces different files. The reason why we do this is purely to match with the 1000 Genomes data, whose variants are all reported on the 'forward' strand. For your typical GWAS, you would not obviously throw away half of your dataset.

I put 'forward' in apostrophes here because there is confusion, generally, about how to interpret forward in relation to coding and non-coding.

Here is what it actually says on the 1000 Genomes pages:

All the variants in both our VCF files and on the browser are always reported on the forward strand.

Note the discrepancies, though: https://www.cell.com/trends/genetics/pdf/S0168-9525(12)00070-4.pdf

A useful idea is to try it with the non-coding strand variants included, and I'm sure that the results will be entirely unexpected when you view your PCA bi-plot

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Ha, the paper is an amazing reference for this topic, thanks for your help!

By the way, I am using Snakemake to adapt/implement your tutorial above - I don't think you are, but if you're doing these workflows without using any workflow tool, check it out: http://snakemake.readthedocs.io/en/stable/tutorial/basics.html

ADD REPLY • link 7.1 years ago by Alex • 0

0

Entering edit mode

Thanks - I'm a computer scientist and biologist who has transitioned into bioinformatics. I have not used Snakemake but understand that it's popular!

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi again Kevin, I have yet another question: On the final merge step (using data from the 1000g GRCH38 liftover release) I encountered an error due to multiallelic sites

Error: 7650 variants with 3+ alleles present

I tried

filtering to biallelic SNPs only in bcftools before anything in plink (using bcftools view -m2 -M2 -v snps file.vcf.gz)
Trying flip as explained in the plink manual
Trying to exclude those sites also as explained in the plink manual

Only the final step worked to get me a merge.

Has this ever occurred to you? Any idea why the bcf filtering could leave residual multiallelic?

ADD REPLY • link 7.0 years ago by Alex • 0

0

Entering edit mode

Hello again. Yes, this usually occurs. I probably should have added a note to the tutorial but I believe that I was at the limit in terms of characters per single post (6,000).

From my experience, the final step in your list is also the one that works. You could split these multi-allelic sites into multiple records with bcftools norm -m-any, but then you'd have another issue of having duplicate records, provided you used dbSNP rs IDs as the ID field. Note also that the same rs ID can refer to different sites (yes, this is true and dbSNP is aware of it).

A 'better' approach is to set the IF field in your VCF to a truly unique value by first splitting multi-allelic sites and then using bcftools annotate to set a unique value (to the ID field).

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

NOTE 22nd July 2018:

This tutorial has been updated (and tested) due to he fact that the 1000 Genomes data is no longer hosted at the University of Washington. The tutorial now downloads data from the EBI, Hinxton, Cambridgeshire, England.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Thanks for wonderful tutorials !! Can you please explain a bit about your tutorial for merging our own data with Thousand Genome Data and doing PCA analysis. Maybe you can explain a bit about your own "have developed my own predictive ethnic model" as well.

Thanks

ADD REPLY • link 6.8 years ago by always_learning ★ 1.1k

0

Entering edit mode

Hey, I provide some further details on the predictive modelling here: A: How to predict individual ethnicity information by using hapmap data It is something that requires refinement.

For merging your own data with the 1000 Genomes (I have done this a few times), you could process your own data as per the above tutorial, and then just merge as normal with all of the 1000 Genomes data. You will likely encounter some small issues along the way.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

0

Entering edit mode

I'm getting an error message saying the options entered were unused options. Any idea what went wrong? The command I used and the result are below. Thank you!

$ plink --noweb --bcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.bcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.plinkOut

    @----------------------------------------------------------@
    |        PLINK!       |     v1.07      |   10/Aug/2009     |
    |----------------------------------------------------------|
    |  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
    |----------------------------------------------------------|
    |  For documentation, citation & bug-report instructions:  |
    |        http://pngu.mgh.harvard.edu/purcell/plink/        |
    @----------------------------------------------------------@

    Skipping web check... [ --noweb ] 
    Writing this text to log file [ ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.plinkOut.log ]
    Analysis started: Wed Aug 22 10:11:11 2018

    ** Unused command line option: --bcf 
    ** Unused command line option: ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.bcf 
    ** Unused command line option: --vcf-idspace-to 
    ** Unused command line option: _ 
    ** Unused command line option: --const-fid 
    ** Unused command line option: --allow-extra-chr 
    ** Unused command line option: 0 
    ** Unused command line option: --split-x 
    ** Unused command line option: b37 
    ** Unused command line option: no-fail 

    ERROR: Problem parsing the command line arguments.

ADD REPLY • link 6.7 years ago by mbk0asis ▴ 700

0

Entering edit mode

Oh! You have to use plink v1.9: https://www.cog-genomics.org/plink2

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I thought I should avoid to use plink2 since they said it was in beta testing on the old home page, and missed your note on the top. Thank you!

ADD REPLY • link 6.7 years ago by mbk0asis ▴ 700

0

Entering edit mode

Yes, it is still in development, technically, but I'm aware that the developer has very limited time. I have re-run the above pipeline many times and it is robust. Some parts, like colouring the samples in the final plot, you have to do yourself, e.g., in R Programming.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

In step 3, when running the first three lines of code to produce the bcf file just for one chromosome ex:

./bcftools norm -m -any ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
./bcftools norm --check-ref w -f human_g1k_v37.fasta | \
./bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf

the terminal is taking an extremely long time and is not completing the file creation (stops around 21 MB). How long should we expect this function to run on all 22 files? I am afraid I am not calling the bcftools correctly or do not have all the necessary files/plugins in the same path or folder as the files I am using to create the bcf files.

Thank you!

ADD REPLY • link updated 6.7 years ago by andrew.j.skelton73 6.6k • written 6.7 years ago by elayna.kirsch • 0

0

Entering edit mode

Can you share the specifications of your computer? Also, which version of BCFtools?

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

My computer is a macOS High Sierra, with a processor of 2.4 GHZ inter core i5 and memory of 8GB 1600 MHz DDR3. When I call BCFtools in the terminal, it says: Version 1.9-37-g90cca6a (using htslib 1.9-20-ge1650a9). I appreciate your help!

ADD REPLY • link 6.7 years ago by elayna.kirsch • 0

0

Entering edit mode

I have 16Gb on my laptop, but I would have thought that 8GB was fine. Can you try to skip the middle part (./bcftools norm --check-ref w -f human_g1k_v37.fasta |), which is not key.

It completes for other chromosomes, I presume?

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I am just trying to run the code on chromosome 1 for now to see if the functions will work/how long it takes. When I remove the middle part and do not yet index the bcf file, it is still getting stuck in creating the file. This is what the terminal looks like:

Laynies-MacBook-Pro:bin layniekirsch$ ./bcftools norm -m-any ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
> ./bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf

It remains like this and does not complete the job for hours. It is creating the file but appears to have a hard time completing the job. When downloading BCFtools, the bcftools and htslib are in different folders. I copied/pasted the executables in the same path as my VCF files, but perhaps there are other files that also need to be located in the same folder in order for the function to work properly?

ADD REPLY • link 6.7 years ago by elayna.kirsch • 0

0

Entering edit mode

Okay, it will always take a long time for chr1 - I am currently testing it on my own laptop. In fact, today, I am going to test the entire tutorial again because the original raw data that I was using was moved from the Washington servers. I can already see that this data (from ftp://ftp.1000genomes.ebi.ac.uk/ ) is larger than the original data that I used, likely due to the fact that it's imputed.

If you can see the output BCF file increasing in size, then it (BCFtools) should be working. There will be some output for each chromosome to terminal screen when each is finished.

I am currently processing chr1. I will let you know when it finishes for me and let you know the final expected BCF file size.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

1

Entering edit mode

It will be possible to use https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 as a starting point in a few weeks (currently working on the plink 2.0 multiallelic-split and indel-normalization code), and avoid bcftools entirely; that should greatly accelerate the preprocessing steps.

ADD REPLY • link 6.7 years ago by chrchang523 11k

0

Entering edit mode

Thanks - that's useful to know. Keep me updated so that I can update this tutorial accordingly!

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, the file size is continuously getting larger. Great, I will wait to hear what happens with processing on your end. Thank you for your help.

ADD REPLY • link 6.7 years ago by elayna.kirsch • 0

1

Entering edit mode

The final size for the chr1 BCF should be 1GB, much larger than the data on which I built this original tutorial. I am going to continue processing the data and updating the tutorial code as necessary. The parts likely to change are Steps 6 and 7.

Obviously this also occupies much disk space, so, be sure to remove all intermediary files once finished.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Okay, it took ~1 day of processing to convert these to BCF format. I'm now proceeding to convert to PLINK format (PLINK). For the purposes of this tutorial, I'm also not now analysing chrX, and updating the code accordingly.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

I had to make quite a few changes with the new imputed data. I have put an entire new tutorial here. All of the code is there: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi very very helpful post. I meet a problem with clustering by plink, because plink basically do not work if you have hundreds of thousands individuals is there alternative solution?

ADD REPLY • link 6.4 years ago by alicodendrochit • 0

1

Entering edit mode

You can get a plot in a reasonable amount of time with plink 2.0's "--pca approx", which is aimed at datasets with hundreds of thousands of samples.

ADD REPLY • link 6.4 years ago by chrchang523 11k

0

Entering edit mode

Hello Kevin,

Thanks for the wonderful tutorial. I have one basic question, and I will really appreciate help from your side.

I have multiple vcf files. How can I merge those to create a 'mega' vcf file (vcf matrix file) having only the genotype information, to be used along with the 'merged' VCF file from 1000 genomes.

I would like to use my VCF files for ethnicity prediction.

Thanks a lot.

ADD REPLY • link 5.4 years ago by caspase8mach ▴ 30

1

Entering edit mode

Hey, your VCFs are in the correct format specification for VCF, correct?

I built an ethnicity predictive model previously via PCA that had >98% sensitivity and specificity - it used the PCA eigenvectors from the method that I describe in this tutorial (above): A: How to predict individual ethnicity information by using hapmap data

For predicting ethnicity by eye, as you are trying to do, you can also just merge the 1000 Genomes data with your own data, and then plot a PCA bi-plot together to see where your samples group with those from 1000 Genomes. You basically just follow the same tutorial as above but, prior to doing anything, filter both datasets so that they have common variants, merge them together, and then proceed through the remaining steps.

plink has a --merge-list flag.

Note that the updated tutorial is here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks a lot Kevin for the prompt reply.

How can I extract only the genotypes from various vcf files I have, extract common variants and merge those? I want to create a merged VCF file from various individual vcf files, similar to the 1000 Genomes merged VCF file.

ADD REPLY • link 5.4 years ago by caspase8mach ▴ 30

0

Entering edit mode

Hi Kevin,

I have an exome data of african ancestry with just 1 phenotype. And I would like to use exome data from 1000 genome project as a control and identify snps under selection. I realized that the vcf is for whole genome sequence. Is there anyway to preprocess the vcf so that I get just the exome or I have to download the raw exome files and call the snps? Please advice thank you

ADD REPLY • link 5.0 years ago by appiahv ▴ 30

0

Entering edit mode

Hey, the new version of this pipeline is here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

I would download and prepare the 1000 Genomes data as per the tutorial so that you have it for the future. Then, you can heavily filter the number of variants when you come to merging your sample with those of the 1000 Genomes.

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hey Kevin, I am attempting to download exome samples from 1000 genome phase3. What I realized was that some individuals have multiple files eg. this individual( https://www.internationalgenome.org/data-portal/sample/NA18505 ) ha the following files

SRR096669.filt.fastq.gz
SRR096669_1.filt.fastq.gz
SRR096669_2.filt.fastq.gz
SRR096670.filt.fastq.gz
SRR096670_1.filt.fastq.gz
SRR096670_2.filt.fastq.gz
SRR716648.filt.fastq.gz
SRR716648_1.filt.fastq.gz
SRR716648_2.filt.fastq.gz
SRR716649.filt.fastq.gz
SRR716649_1.filt.fastq.gz
SRR716649_2.filt.fastq.gz

I am interested in paired-end sequence. The question is do I pick a pair and call the variants from or I have to combine all the files.?

ADD REPLY • link updated 5.0 years ago by Kevin Blighe 89k • written 5.0 years ago by appiahv ▴ 30

1

Entering edit mode

I am not sure. Perhaps download all and take a look at the read statistics and quality. Some of these may be 'testing runs' and have low numbers of reads, or represent failed runs.

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks Kevin, in the end I decided to merge them 1.fastq.gz >merged_R1.fastq.gz 2.fastq.gz>merged_R2.fastq.gz I think this will give me high coverage.

ADD REPLY • link 5.0 years ago by appiahv ▴ 30

0

Entering edit mode

Sure, that seems like a reasonable approach.

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Are you searching for novel variants missed by the original research group and publication in Nature? I re-analysed a few samples and was able to uncover all variants using just samtools mpileup. The original authors used 3 less sensitive variant callers in order to increase their chances of finding all true variants.

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

I have 14 samples of exome sequence for cancer and I am looking to find snps which are under selection. However I dont have control samples and so I intend to use one population(african ancestry) as control.

I have downloaded the raw fastq exome data (59 samples) from 1000 genome website, created bam files for both my 14 samples and the downloaded 1kg exomes. I am now doing a joint variant calling using gatk and bcftools after which I perform GWAS on the vcf outputs using Plink. My concern is that does using the 1kg data as control a right approach? and also combining all the data to call for snps also a right step?

I have just entered in the world of population genetics and much help and suggestions will be appreciated.

ADD REPLY • link 5.0 years ago by appiahv ▴ 30

0

Entering edit mode

It seems like a reasonable approach

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

great code! I was just wondering if you could explain why did you choose for pruning --indep 50 5 1.5 why not for example --indep-pairwise 50 5 0.5? Is there is a guidance somewhere on how one can choose there size of the region and r2? I imputed around 6 million SNPs and I am wondering how to set that parameter.

ADD REPLY • link 4.8 years ago by anamaria ▴ 220

1

Entering edit mode

--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

The 1.5 value relates to the VIF, and it should not really change too much, in my view. The two other values are important and will be affected by the genotyping density of the array that one is using. For a high density array, genotypes may be spaced roughly evenly across the genome, while, for others, the spacing may be inconsistent.

I am not aware of any publication where the values of these are specifically defined. A quick search reveals that VIF 1.5 is used frequently, probably in a similar fashion as we may use |log2FC|>2 for differential expression.

So, no right or wrong answer.

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

@Kevin Thanks for this post, it's proven to be a really helpful guide. I do have a question about merging array genotype data with WGS...I notice a strange occurrence in my PCA (and admixture plots, and F statistics...) after merging array data with a few existing WGS sets. Basically what's happening is the array samples look to be very unique in all of these analyses, which just isn't possible according to common sense in this project. So, clearly I've done something wrong in my QC/filtering/merging through plink.

My question is about step 5 in your workflow. How do I remove SNPs not on the coding strand and what effect might this cause if they are not removed? I don't think I fully understand the logic of what is happening at this step, and perhaps it's part of my problem. I'm having a hard time finding the right keywords to find answers/suggestions for the issue in my QC workflow so any advice would be really helpful! Thanks!

ADD REPLY • link 3.6 years ago by sara27888 • 0

0

Entering edit mode

Hi Sara, you will have to retrieve the 'manifest' or other annotation file for the array that you are using. The strand information should be located in that file. Such a file would usually be available for download on the respective manufacturer's website.

ADD REPLY • link 3.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Can you help me understand WHY the alleles on the coding strand need to be removed? I'm also not understanding if the coding strand is the same as the top/bottom or forward/reverse or +/-. The manifest file for the Illumina chip I used doesn't indicate in terms of coding/noncoding. It does give information about the : "IlmnStrand," "SourceStrand," and "RefStrand." These are listed as TOP/BOT/PLUS/MINUS and +/-. Is there a tool for removing snps on the coding strand? Or is this something it is typical to write one's own script to do? Thanks!

ADD REPLY • link 3.6 years ago by sara27888 • 0

0

Entering edit mode

Hi Kevin Blighe , Thank you for this great Tutorial, I have followed the same steps but I cannot understand the interpretation of the proportion variances here! I need to get the cumlitave proportion of each PC to determine which ones to plot. Thanks again,,

ADD REPLY • link 2.5 years ago by Mahmoud.Bassyouni • 0