Tutorial:Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
0
95
Entering edit mode
6.3 years ago

Note1 - Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 - this data is for hg19 / GRCh37

Note3 - GRCh38 data is available HERE


The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

  • tutorial now entirely streamlined - all commands, including in R, are now included
  • duplicate variants are now removed with BCFtools, not PLINK (previous Step 6 removed)
  • now only performs PCA (originally, MDS was also performed but never used)
  • no longer using chrX variants (only autosomal variants)
  • new Step 3, indicating how to download the 1000 Genomes GRCh37 reference build

Program requirements:

  • plink > v1.9
  • BCFtools (tested on v 1.3)

Disk space requirements:

  • downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
  • converted BCF files and their indices, ~14 GB
  • binary PLINK files, ~53 GB
  • pruned PLINK binary files, ~ <1 Gb

1, Download the files as VCF.gz (and tab-indices)

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done

2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

3, Download the GRCh37 / hg19 reference genome

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai ;
gunzip human_g1k_v37.fasta.gz ;

NB - if wget is not working, try curl:

curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz -O human_g1k_v37.fasta.gz

4, Convert the 1000 Genomes files to BCF

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

  • -I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT*

  • -x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT*
for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
      ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz | \
      bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
        bcftools norm -Ob --rm-dup both \
          > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ;
done

5, Convert the BCF files to PLINK format

for chr in {1..22}; do
    plink --noweb \
      --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \
      --keep-allele-order \
      --vcf-idspace-to _ \
      --const-fid \
      --allow-extra-chr 0 \
      --split-x b37 no-fail \
      --make-bed \
      --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;
done

6, Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

7, Prune variants from each chromosome

  • --maf 0.10, only retain SNPs with MAF greater than 10%
  • --indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb \
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes \
      --maf 0.10 --indep 50 5 1.5 \
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;

    plink --noweb \
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes \
      --extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.prune.in \
      --make-bed \
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;
done

8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;
sed -i 's/.bim//g' ForMerge.list ;

9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;


NB - if you have your own data that you want to merge with 1000 Genomes

Process the 1000 Genomes data as per this tutorial from Steps 1-9. In this way, you will have already identified the population-specific variants / markers. Then, after Step 9, do

  • find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
  • merge the 1000 Genomes data with your own data
  • proceed to Step 10

Depending on its size, your own dataset may be divided by chromosome; so, you may have to do some pre-processing before aligning to 1000 Genomes. Either way, the population specific markers will be defined by just the 1000 Genomes dataset (Step 7). If your dataset is microarray, you'll have to pre-filter it for coding (plus / +) strand variants (Step 6).



10, Perform PCA

plink --bfile Merge --pca

11, Generate plots in R

R
options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')

# read in the PED data
PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = '\t')
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] TRUE

# set colours
require('RColorBrewer')

# from: http://www.internationalgenome.org/category/population/
PED$Population <- factor(PED$Population, levels=c(
  "ACB","ASW","ESN","GWD","LWK","MSL","YRI",
  "CLM","MXL","PEL","PUR",
  "CDX","CHB","CHS","JPT","KHV",
  "CEU","FIN","GBR","IBS","TSI",
  "BEB","GIH","ITU","PJL","STU"))

col <- colorRampPalette(c(
  "yellow","yellow","yellow","yellow","yellow","yellow","yellow",
  "forestgreen","forestgreen","forestgreen","forestgreen",
  "grey","grey","grey","grey","grey",
  "royalblue","royalblue","royalblue","royalblue","royalblue",
  "black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)]

# generate PCA bi-plots
project.pca <- eigenvec
summary(project.pca)


par(mar = c(5,5,5,5), cex = 2.0,
  cex.main = 7, cex.axis = 2.75, cex.lab = 2.75, mfrow = c(1,2))

plot(project.pca[,1], project.pca[,2],
  type = 'n',
  main = 'A',
  adj = 0.5,
  xlab = 'First component',
  ylab = 'Second component',
  font = 2,
  font.lab = 2)
points(project.pca[,1], project.pca[,2], col = col, pch = 20, cex = 2.25)
legend('bottomright',
  bty = 'n',
  cex = 3.0,
  title = '',
  c('Population 1', 'Population 2', 'Population 3',
    'Population 4', 'Population 5'),
  fill = c('yellow', 'forestgreen', 'grey', 'royalblue', 'black'))

plot(project.pca[,1], project.pca[,3],
  type="n",
  main="B",
  adj=0.5,
  xlab="First component",
  ylab="Third component",
  font=2,
  font.lab=2)
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)

biplot-new

Kevin

1000genomes PLINK PCA • 54k views
ADD COMMENT
0
Entering edit mode

Hi Kelvin, thank you very much for your tutorial. It helps me a lot to plot the PCA-bi plot of 1kg Phrase III samples.

My question is, if I would like to plot my own sample's PCA with 1kg sample's PCA (My samples were genotyped by SNP array. There were around 700k SNPs. I want to check whether my samples came from certain ethical group), shall I directly combine the eigenvec calculated from my sample and eigenvec got from your step 10 and then plot? Can the two eigen vectors comparable?

I ask this question because in my understanding, PC is the linear combination of independent variables. In my sample, PCs are the linear combination of the pruned genotyped SNPs whereas in 1kg samples, their PCs are the linear combination of other pruned (and much more) SNPs. I doubt whether I can directly compare the PCs. If I can't compare my sample's PCs with 1kg's in this way, how can I compare the ethical group of my sample to the reference 1kg samples?

Thank you very much for your possible reply in advance.

ADD REPLY
0
Entering edit mode

Hey, my apologies for I seem to have missed this. It is better to combine them within PLINK and then perform PCA on the combined dataset. I have done this many times and it usually works, in the sense that sample ethnicities line up with the expected group from 1000 Genomes.

ADD REPLY
0
Entering edit mode

Hi Kevin,

Does the following command do a liftover of 1000 genomes to Hg19. I need to do QC on 1000 genomes data before using for LD clumping and as I haven't done this before, I'm a bit unsure. I followed your steps until this point and then proceeded to convert to plink, removed SNPs based on call rate and HWE and will then proceed to use them for LD clumping

for chr in {1..22}; do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

done

Thanks

ADD REPLY
0
Entering edit mode

Hey, I am not sure what you mean by 'liftover' in this context (?). If you want to do LD clumping, then you just need to complete steps 1-5. This will take ~1 day, though. The imputed 1000 Genomes data is 'quite' dense.

ADD REPLY
0
Entering edit mode

The commands that you have pasted into your comment do the following:

  • bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta - checks that the REF variants in the 1000 Genomes VCF files corresponds to the base in the reference genome, human_g1k_v37.fasta.
  • bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' - annotates the ID field in the 1000 Genomes VCF files to be, for example, chr1:12345645:A:T
  • bcftools norm -Ob --rm-dup both - the 1000 Genomes data contains duplicate variants, which have to be removed for PLINK
ADD REPLY
0
Entering edit mode

Hi Kelvin, thank you very much for your tutorial. It really helped me a lot.

I am trying to check that the 89 individuals I am working with are of European origin by comparing them with 1000 Genomes dataset. However, the data I have for my samples was obtained from targeted sequencing of 1,293 regions. My main concern is whether I have to extract only SNPs falling inside my targeted regions from the 1000 Genomes dataset before merging with my 89 samples or, whether I should use all the information from 1000 Genomes.

Thank you for your possible reply in advance.

ADD REPLY
0
Entering edit mode

Hey, I previously achieved this by building a predictive model from the 1000 Genomes Data and then applying it to my own data, which was also targeted sequencing. See here: A: How to predict individual ethnicity information by using hapmap data

The other way to do it is filter the datasets so that they have the same starting list of variants, and then proceed with the tutorial from part 7. For this to work, it would help if your variants are all labeled by rs ID or as chr:bp:ref:var

ADD REPLY
0
Entering edit mode

Thanks so much for this tutorial, Kevin !!

ADD REPLY
0
Entering edit mode

¡De nada!, Mr Locuace

ADD REPLY
0
Entering edit mode

Hey Kelvin,

How to plot sub-population ("ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU") in pca plot instead of "African","Hispanic","East-Asian","Caucasian","South Asian"?

ADD REPLY
1
Entering edit mode

Hey, you will simply have to modify the input colour vector. You will have to take great care to ensure that the colours line up to the data that is being plotted.

ADD REPLY
0
Entering edit mode

Hi Kevin, Just a small typo, it's actually:

gunzip human_g1k_v37.fasta.gz ;

ADD REPLY
0
Entering edit mode

Thanks! / Obrigado!

ADD REPLY
0
Entering edit mode

thanks alot for this tutorial i have a vcf data file (my data)and i want to do pca how do i do that? i apologise for a naive question but i am a beginner and badly struck

ADD REPLY
1
Entering edit mode

With your own data, your first step should be to import it to PLINK format, and, from there, you can perform PCA immediately. In order to obtain 'population informative' variants, though, you will have to do some of the filtering steps as I do in this tutorial. Basically, you can follow the same tutorial as here.

If you want to integrate your data with the 1000 Genomes data, then it can be a bit more difficult.

ADD REPLY
0
Entering edit mode

Thanks alot kevin I dont have any fasta file just a vcf (freebayes) and i want to perform pca to check out similarity of my sample with other ethnic groups do i need to have a fasta file of my data?

ADD REPLY
1
Entering edit mode

Hello. No, you can start with the VCF. Your first task should be to convert this VCF to PLINK format and ensure that you have retained sample information in the PLINK files. Do you know how to do this? Are all of your samples from the same ethnic group?

ADD REPLY
0
Entering edit mode

Hello kevin I am using bcftools to normalize and annotate variants. link to my data a vcf of 10mb When i try to convert the data in bcf like the third step of this wonderful tutorial. I get an error the sequence chrM not found. The error message indicates that there are some variants on the mitochondrial chromosome but that this chromosome is either missing from your reference or annotated differently.(I am using the GRCh37 / hg19 fasta file).I have only one sample till now and the coming samples will have the same ethnicity. I want to perform PCA and see the similarity of my sample with the known ("African","Hispanic","East-Asian","Caucasian","South Asian"). can you please suggest me what to I apolosize for asking a naive question(I have just started out in this field) and my english is a so not very good so thanks alot for cooperating and helping me out.

ADD REPLY
0
Entering edit mode

Hello, you just need to remove the chrM variants from your input file. You can do this indirectly or directly in many different ways, such as: A: How to extract specific chromosome from vcf file

ADD REPLY
0
Entering edit mode

Hello kevin THANKS A LOT !!!

I remove the chroM varians from my data using the grep -w '^#|^#CHROM|^chr[1-22]' my.vcf > my_new.vcf (I hope it keeps the header and the 22 chr I need) as other methods were either giving index error or header error.

but now I get an errormy error

Now I am getting an error that sequence "chr1" not found

link to my data https://drive.google.com/file/d/1jKaJGI0UW3lYGxmhNUdBchZzMrb8ckZ-/view?usp=sharing

I want to get the pca plot of the variants but i am getting badly struck and thanks a lot for helping me

ADD REPLY
0
Entering edit mode

Perhaps the header has been lost. Can you provide the output of

bcftools view -h my_new.vcf

?

For parsing VCFs, I usually use AWK. Using grep, I would have used something like this, I think:

grep -e "^chrM" -v my.vcf
ADD REPLY
0
Entering edit mode

Hello kevin i used an editor named as em editor to change the vcf file since the annotation in the fasta file and my vcf file was different i changed my vcf file accordingly chr1 became 1 chr 2 became 2 and so on and i removed the chrM from my data, by following this method might have preserved the header the result of bcftools view -h my_new.vcf result1 . result2 result3 resluty

ADD REPLY
0
Entering edit mode

It seems that chrM is no longer present in the file. However, your contig names are just 1, 2, 3, 4, but the BCFtools norm command is searching for 'chr1', 'chr2', etc. It is looking for these due to the fact that these are the contig names used in human_g1k_v37.fasta.

To be honest, the use of this part is not necessary: --check-ref w -f human_g1k_v37.fasta You can just exclude it and use:

bcftools norm -m-any data.vcf \
...

What are the contig names in the 1000 Genomes data that downloaded? - you can use the BCFtools view -h command again.

ADD REPLY
0
Entering edit mode

Hello

`   bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
 data.vcf | \

  bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \

    bcftools norm -Ob --rm-dup both \
      > data.bcf ;

` result : chrM not found error After removing the chrM the same command gave :chr1 not found error Then i edited the vcf file changed the chr1 into 1 and so on I now get the error after removing chrM and changing the vcf

I have tried everything i altered my vcf according to the fast but still i am not able to get the bcf file. Is there another way to do ethnicity regagrding pca?

THANKS ALOT ! I wish for your best health keep safe

ADD REPLY
0
Entering edit mode

So I have some (probably hispanic descent) tumor WGS samples in red dots on my PCA plot projected onto 1kg data like in this tutorial (btw thank you Kevin Blighe for pointing me to this tutorial), and I am wondering how closely can we expect unknown samples to follow any given 1kg clusters. My red dots, although we believe they are hispanic, seem to form a group all of their own, which could be because they all have the same cancer, but I would have thought they should cluster more closely with the AMR super-population, no?

Is there a way to verify how accurate my PC projection is?

my example PCA plot

ADD REPLY
0
Entering edit mode

Hey, looks like you at least got through the tutorial successfully - it can be difficult, particularly when overlaying your own samples. From my experience, and I have done this a good few times, one's own samples can align quite well with the 1KG samples, once you have done the preparatory work correctly.

The AMR (Ad Mixed American) group is clearly the most diverse in the cohort, but it is sampling from quite a diverse pool:

  • MXL Mexican Ancestry from Los Angeles USA
  • PUR Puerto Ricans from Puerto Rico
  • CLM Colombians from Medellin, Colombia
  • PEL Peruvians from Lima, Peru

[ https://www.internationalgenome.org/category/population/ ]

There's a potential that those (your) samples could be verging into the African group, too. I have also seen Brazilian samples group in that region, previously.

If I were you, I would check PC1 vs PC3, too, as the AMR group should separate out on its own plane along PC3.

It does genuinely work very well, though. I built a predictive model from the eigenvectors previously, and made predictions on samples of known and unknown ethnicity. The model had >99% sensitivity and specificity - A: How to predict individual ethnicity information by using hapmap data

ADD REPLY
0
Entering edit mode

Hi, Kevin. Do you know why this post is not showing the replies? It says it has 10 replies, but it appears empty on my end. enter image description here

ADD REPLY
0
Entering edit mode

Yes, I deleted the post in order to avoid offending anybody hyper-sensitive to the topic of genetics and ethnicity.

ADD REPLY
0
Entering edit mode

Oh, I see... If you happen to have this at hand, would mind sharing privately? my e-mail is
enter image description here

ADD REPLY
2
Entering edit mode

Não acho que ainda tenho isto cara. Basicamente, faço o PCA como eu faço aqui nesta pagina com o 1000 Genomes Phase III. Depois, uso os eigenvectors em um modelo através de glm de forma glm(ethnicity ~ PC1 + PC2 + PC3 ... PCn), e faço modelagem usando predict() com amostras novas. Para selecionar os eigenvectors, faço o 'stepwise regression' (https://www.statology.org/stepwise-regression-r/). Valeu

ADD REPLY
1
Entering edit mode

Valeu, estava interessado mais no tipo de modelo mesmo, então foi um glm com stepwise. Obrigado :)

ADD REPLY
0
Entering edit mode

hello i want to plot an pca just like u but with my data can you give me some advice on what to do?

ADD REPLY
0
Entering edit mode

Hi, I need to see examples of your data and understand what you currently have. Otherwise, I cannot do anything.

ADD REPLY
0
Entering edit mode

hello kevin i also want to merge my data(data of 10 persons) with the 1000 genome data and want to plot a pca like the one above.the categories will be the existing ethnicity 'Population 1', 'Population 2', 'Population 3', 'Population 4', 'Population 5' and a label "my data" the plot will contatin pca of 1000 genome data along with my points (my samples). similar in the manner the user omg what i am doing has done. what i have to do can you please advise me.

ADD REPLY
0
Entering edit mode

I think that you basically repeat the process (above) independently for each dataset. When you come to step 9, however, there will be a few different steps:

  • Step 9a: merge the 1000 Genomes data as per the tutorial
  • Step 9b: find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
  • Step 9c: merge the 1000 Genomes data with your own data
  • Step 10: perform PCA
ADD REPLY
1
Entering edit mode

I have now added these extra steps to the main tutorial.

ADD REPLY
0
Entering edit mode

hello i get this error at step 4 enter image description here

ADD REPLY
0
Entering edit mode

Hello. Can you confirm that every step prior to Step 4 has completed successfully?

ADD REPLY
0
Entering edit mode

yes every thing else was successfully completed enter image description here

ADD REPLY
0
Entering edit mode

Are you sure? - I see no proof of that in your screenshot. What is the output of

ls -l
ADD REPLY
0
Entering edit mode

hello kevin ! enter image description here enter image description here

is there is some thing wrong with my computer ? i can also screen share with you or send u my data if i will help you to help me i have 10 individual vcf samples and want the pca plot with my data labeled as "our data" thanks alot kevin fr your support and help..i relly appreciate your efforts!! stay safe and stay healthy

ADD REPLY
0
Entering edit mode

It looks like your BCF files have not yet been created. You need to re-run Step 4 (above)

ADD REPLY
0
Entering edit mode

Hello kevin

every step prior to Step 4 has completed successfully as you can see when i run ls -l command all the files are successfully downloaded enter image description here

When i run step 4 i get this error

ADD REPLY
0
Entering edit mode

If you run the commands exactly as they appear in the tutorial, there should not be a problem. Can you confirm the shell that you are using.

Also, next time, can you show the value of:

ls -l | head

...and also paste the full command that you are using. When I say 'paste', can you paste the text here.

ADD REPLY
0
Entering edit mode

thanks a lot Kevin for helping me

ls -l | head gives this output

kk@DESKTOP-0SEI0MU:/mnt/d/bio$ ls -l |head
total 17711460

-r-xr-xr-x 1 kk kk     155869 Jun 21 20:53 20130606_g1k.ped
-r-xr-xr-x 1 kk kk 1216886729 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     224660 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  773788987 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133086 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  767084423 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133331 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  740805962 Jun 21 15:14 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133171 Jun 21 15:29 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
ADD REPLY
0
Entering edit mode

Hi! So, you still need to re-run Step 4. It has evidently not completed successfully due to the fact that there are no BCF files.

ADD REPLY
0
Entering edit mode

hello kevin i apologies for taking your time but i followed every thing you can see that all files were downloaded every thing worked till step 3 but command at step 4 was not able to execute you can check as i have attached screen shot please let me know what i can do.

Thank alot kevin

stay safe

ADD REPLY
0
Entering edit mode

I am in good health. I trust that you are, too. Espero que estés bien.

If you run this, is there a different error message?

bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
  ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
  bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
    bcftools norm -Ob --rm-dup both \
      > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
ADD REPLY
0
Entering edit mode

Was this resolved?

ADD REPLY
0
Entering edit mode

Hello 17bcs030,

Please refrain from using instant messaging/SMS jargon such as "fr", "u", etc. on professional and scientific forums. Write in complete sentences and use professional language.

ADD REPLY
0
Entering edit mode

I will try not to use these jargons i will use complete sentences and use professional language only, My sincere apologies

ADD REPLY
0
Entering edit mode

Hi Kevin, thank you for your tutorial. I want to merge my own data with 1kg and plot PCA-bi plot, I have some questions for these: Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary?

Q2: 1)a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies?

Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf

Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions.

ADD REPLY
0
Entering edit mode

Hey,

Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary?

You can do that. Actually, pruning for LD is performed via the --indep flag (see your question 4)

Q2: 1)a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies?

Yes, there is a difference, and one will obtain different results depending on the order of these operations. I have not tested which order is preferential over the other.

Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf

It is already of good quality from source. However, in this tutorial, I provide a normalisation step to split multi-allelic calls in step 4.

Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions.

This is pruning based on LD. More information here: https://www.cog-genomics.org/plink/1.9/ld

It would be altered if samples were removed - yes.

ADD REPLY
0
Entering edit mode

Hi Kevin, thanks for your explanation. I was reading this comment and tried myself. For the Q2 if I prune 1kg and my dataset respectively first, I will have much fewer SNPs overlapped when merging. So I guess we should merge first and then prune? Can I directly merge my unpruned dataset to pruned 1kg (this way still only pruned SNPs are used)?

ADD REPLY
0
Entering edit mode

Hi Kevin Blighe, thanks for this incredible tutorial! I was wondering: having obtained the plink.eigenval and plink.eigenvec files using plink's --pca function, is there an objective way of identifying how many population covariates should be included in downstream genetic analyses (e.g. generation of polygenic risk scores or eQTL work)? I am following this other post, but I am not sure I understand what's this objective way. Should I analyse PC 1 by 2, 2 by 3, 3 by 4, 4 by 5 and so on, until I find a normal distribution? Is there a "numeric" score per PC that could be useful to make this more objective (e.g., include only principal components as covariates if they are associated with eigenvalues > 1.5?). Any clarification on this would be much appreciated!

Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change.

Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?)

Muito obrigado, caríssimo!! And I am sorry if these questions are too basic.

ADD REPLY
1
Entering edit mode

Boa tarde! Some people just inspect the bi-plots and make a decision about PCs to include (as covariates) in that way. In my opinion, however, it is better to regress the PCs against population, and then retain the statistically significant PCs as covariates. For example:

glm(population ~ PC1, ...)
glm(population ~ PC2, ...)
...
glm(population ~ PC[x], ...)

, or:

lm(PC1 ~ population, ...)
lm(PC2 ~ population, ...)
...
lm(PC[x] ~ population, ...)

The values to use for the PCs should be in the eigenvec file (?). The PCs to retain are those that pass p<0.05.

Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change.

I am not sure, but I I think that the first 10 will always be the same.

Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?)

This is likely a batch covariate, yes, and it may be altering the PCA calculations. The biggest worry is that the different chips / arrays may be genotyping different SNPs, or different strands (coding / non-coding; plus / minus) of the same SNP. Some upstream pre-processing (before PCA) should be performed to ensure alignment across the different chips, I think.

ADD REPLY
0
Entering edit mode

Hello Kevin, I´m sorry for bothering you. I´m trying to convert 1000 genomes vcf file (ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes) to plink format and merge it with my in-house dataset (already in plink format) so I can perform admixture analysis and maybe PCA also. I would need to follow all the steps you detailed here? What would be the problem if I just do the conversion of the vcf with plink --vcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes --out cromX1KG ? I would be very gratefull with your clarification!! Iriel

ADD REPLY
0
Entering edit mode

Hey, no problem. In this tutorial, I do not focus on any allosome (chrX or Y); however, the data is available at the FTP sites:

Yes, you can follow some of the steps here in order to merge your data with the 1000 Genomes chrX data, primarily Steps 4 + 5, and then follow advice Here

ADD REPLY
1
Entering edit mode

Thanks for the answer! I will try to follow your instructions. Iriel

ADD REPLY
0
Entering edit mode

De nada! There is technically no right or wrong way in which to merge the datasets. The 'merge' can occur at different stages of processing. Also, you may not want to do, for example, Step 7, from my tutorial above.

ADD REPLY
0
Entering edit mode

Hi Kevin, I have tried steps 4 and 5 of your tutorial on chromosome X vcf from 1000 genomes, then I filtrate my in house dataset of chromosome X SNPs to only keep SNPs on the + strand (according to Illumina microarray files). I also found the common SNPs using the intersect function from R and extracted those common SNPs from each dataset.

The problem comes out when I try to merge both datasets with plink

plink --bfile chrXKG2_int --bmerge cromXplus_int.bed cromXplus_int.bim cromXplus_int.fam --make-bed --out cromXmerged

Plink throws an error saying that there are several variants with 3+ alleles present, that this could be due to strand inconsistency.

I think I should do the same selection of + strand genotyped SNPs in 1000 genomes dataset, do you have any command to do that?

Thanks in advance for your answer!

ADD REPLY
1
Entering edit mode

Yes, you would need to filter the 1000 Genomes dataset for these coding (+) strand variants too. I believe the PLINK flag to do this is --exclude

ADD REPLY
0
Entering edit mode

Hola kevin gracias por un tutorial tan maravilloso

I have my few sample files and I want to merge them with the 1000 genome and want to perform the PCA.

I followed your great tutorial but I get an error at step 4. I am using bcftools 1.3 and I have performed all steps 1,2,3.

Here you can see I have downloaded all the 1000 genome files along with the reference fasta

Files Files

When I run the Step 4 I get this error

ERROR ERROR

When I run this:

 bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
   ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
   bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
     bcftools norm -Ob --rm-dup both \
      ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

bcf

I get bcf for chr1 only that too incomplete(few bytes in size), if I replace chr1 with chr2 I again get the same segmentation error

I am using a laptop with 16 GB of ram. I am badly struck. thanks a lot for your help

Mantente seguro

DIOS TE BENDIGA, REZO POR TU BUENA SALUD

ADD REPLY
0
Entering edit mode

Please do not paste screenshots of plain text content - it is counterproductive. Use GitHub gists to show us the actual text content instead. See this post on how to use GitHub gists: A: How to Use Biostars Part-3: Formatting Text and Using GitHub Gists

ADD REPLY
0
Entering edit mode

greetings Thanks alot for your valuable suggestion i will make sure to follow the method you have prescribed

ADD REPLY
0
Entering edit mode

Hi, you seem to be using two accounts: krishnakatyal5121 and 17bcs030

Please do not do this. Stick to one account. Send an email to the administrator (admin at biostars dot org) and request that your accounts be merged.

ADD REPLY
0
Entering edit mode

Greetings I have mailed the administrator,If I have caused any incoviniance I ask for apologies. Thanks alot , I request if you could help me with my problem.stay safe

ADD REPLY
0
Entering edit mode

Greetings,

I am a complete beginner. I was trying to perform step 1, but got stuck. Get the following output where it says, NO SUCH FILE, this is for all chromosomes...Hope someone will explain what i have done wrong.

Thanks
Lax

--2021-04-21 17:09:08--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
           => 'ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'
Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.193.140
Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.140|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/ftp/release/20130502 ... done.
==> SIZE ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ... done.
==> PASV ... done.    ==> RETR ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ...
**No such file** 'ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'.
ADD REPLY
1
Entering edit mode

Hi Lax,

I just tried this and got the same error as you when typing in the tutorial commands as written. If you go to the ftp url from the prefix in your browser, (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), you'll see that the file names are slightly different: version b instead of version a as written in the original tutorial (e.g. ...integrated_v5b.20130502...). When I updated the suffix text, the download worked for me.

Rewritten step 1 with updated file name suffix:

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done

Hope this helps!

ADD REPLY
0
Entering edit mode

Hi Kevin I have used the --make-pgen flag which give me .pgen, psam, and pvar, after that I used pca command and got .eigenvec and .eigenval. I want to draw a plot using R but the script here used PED. how can I plot in R for Psam, pvar and pgen?

ADD REPLY
0
Entering edit mode

Hi, --make-pgen just creates a binary dataset. Can you not recode this as plain text PED? Sorry, I am tired and will check back in the morning

ADD REPLY
0
Entering edit mode
> R
Error: object 'R' not found
> 
> options(scipen=100, digits=3)
> 
> # read in the eigenvectors, produced in PLINK
> eigenvec <- read.table('plink2.eigenvec', header = FALSE, skip=0, sep = ' ')
> rownames(eigenvec) <- eigenvec[,2]
Error in `[.data.frame`(eigenvec, , 2) : undefined columns selected
> eigenvec <- eigenvec[,3:ncol(eigenvec)]
Error in `[.data.frame`(eigenvec, , 3:ncol(eigenvec)) : 
  undefined columns selected
> colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')
Error in names(x) <- value : 
  'names' attribute [20] must be the same length as the vector [1]
ADD REPLY
0
Entering edit mode

It comes across as a bit rude to paste some errors and provide no commentary - sorry.

By having R there, I am invoking the R environment for you, i.e., you type R at your command prompt in order to load R.

Regarding the other errors, please show the first few lines of eigenvec

ADD REPLY
0
Entering edit mode

sorry, my understanding is very low in bioinformatics and I want to plot the PCA graph using Rstudio. i have used plink for the generation of pca. i have virus vcf file of SNPs which I converted to .psam, pvar, and pgen. after that i use PCA command and finally got two file eigenvec and eigenv. after that i put the script in Rstudio that you provide in the tutorial.

(R

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '') 

I HAVE to try it with both typing the R and without typing the R in the command prompt still show error.

ADD REPLY
0
Entering edit mode

I only added the R there to indicate that you then need to move from the shell command prompt to the R environment. In your case, you just need to move to RStudio.

https://pasteboard.co/K6q0NJX.png

ADD REPLY
0
Entering edit mode

which library is need to plot PCA graph in R. ggplot, tydiverse

ADD REPLY
0
Entering edit mode

You just need base R functions. I am guessing that your input data is different, for some reason. How did you generate the plink.eigenvec file? - please show the first few lines from it.

ADD REPLY
0
Entering edit mode

below is the Ist lines of plink.eigenvec

#IID    PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
MH218731.1  0.0152208   0.00944359  0.0103379   -0.016281   -0.0104166  -0.0147989  -0.00897963 0.0352926   -0.00974202 -0.218918
MG572182.1  0.0377849   -0.0284037  0.000547742 0.00247451  0.00140346  0.0019857   0.0104293   -0.0535947  -0.0190876  -0.0980317
MG746041.1  0.047474    -0.0575027  -0.00575743 0.0124768   0.00614564  0.011023    0.00851316  -0.0434487  -0.0448973  -0.000333792
MG746042.1  0.047592    -0.0572615  -0.00527609 0.0126947   0.00621977  0.0110911   0.00909679  -0.041873   -0.0446051  0.000632583
MG746043.1  0.0473333   -0.0576048  -0.00583835 0.0118375   0.00520263  0.0110696   0.00909136  -0.0422423  -0.0435948  -0.000432457
MG746044.1  0.0473676   -0.0576595  -0.00571874 0.0119169   0.00565738  0.010653    0.00851585  -0.0424485  -0.0444913  -0.000936355
MG746045.1  0.0473575   -0.057696   -0.0058114  0.0120366   0.00569223  0.0107174   0.0087301   -0.0424384  -0.0439578  0.000366145
MH671553.1  0.0350993   -0.0216172  0.00691768  -0.00643424 -0.00312974 -0.0278186  -0.0265101  0.163726    -0.100952   0.0419335
MG746040.1  0.0471743   -0.058322   -0.00612036 0.0125193   0.00663086  0.0109957   0.00841344  -0.0417326  -0.0416117  0.00165737
MG746010.1  0.0471363   -0.0583 -0.00623652 0.0125998   0.00715682  0.0109305   0.00864048  -0.0419027  -0.0419498  0.00148289
MG746005.1  0.0473502   -0.0577398  -0.00630305 0.0123974   0.00685714  0.0109009   0.00909979  -0.0418438  -0.0434613  0.00155532
MG746006.1  0.0475522   -0.0575318  -0.00610206 0.0122633   0.00640293  0.0114584   0.00826131  -0.0416395  -0.0444442  0.000140273
MG746007.1  0.0475734   -0.0572951  -0.00571316 0.0122144   0.00648051  0.0105819   0.00822079  -0.041815   -0.0413793  0.00127995
MG746008.1  0.0477568   -0.0571485  -0.00535316 0.0124548   0.00599396  0.0105851   0.00793466  -0.0419781  -0.0430056  -0.0027565
ADD REPLY
0
Entering edit mode

Hi, I got an error when running ‘gunzip human_g1k_v37.fasta.gz’ in step 3. The error message is 'unexpected end of file'. Is the file damaged? Thank you.

ADD REPLY
0
Entering edit mode

Can you confirm that the download completed?

ADD REPLY
0
Entering edit mode

It shows that it is done. Is the compressed file 115 KB? By the way, when I used a decompress software, although it showed that the file cannot be extracted properly, there was an output file. In the output file, there are many rows consisting of N after the first row, and then comes T, A, C, G. Is it normal? I am afraid that the file has some differences compared to the original one.

ADD REPLY
0
Entering edit mode

No, human_g1k_v37.fasta.gz is just a gunzipped version of the hg19 reference genome; so, ~900MB in size. Can you see it by going here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

For the record, I have an encoding algorithm that can 'zip' this genome to below that of 2bit.

ADD REPLY
0
Entering edit mode

Yes, I can see the file. But when I downloaded it, the download just finished when the file size is 115 KB.

ADD REPLY
0
Entering edit mode

Please check available disk space, etc. Apologies if you have already done this.

ADD REPLY
0
Entering edit mode

Hi Charles, I am having the same issue. Do you manage to solve it?

gunzip human_g1k_v37.fasta.gz

gzip: human_g1k_v37.fasta.gz: unexpected end of file

ADD REPLY
0
Entering edit mode

Can you please show the complete output after you have run:

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;
ADD REPLY
0
Entering edit mode

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz --2021-06-21 10:34:13-- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.197.77 Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.197.77|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/octet-stream] Saving to: ‘human_g1k_v37.fasta.gz’

[                     <=>               ] 3,424,591    257KB/s   in 13s    

2021-06-21 10:34:26 (263 KB/s) - ‘human_g1k_v37.fasta.gz’ saved [3424591]

This seems to be a Ubuntu issue. I used Mac to do wget and gunzip. It output

gunzip: human_g1k_v37.fasta.gz: trailing garbage ignored

but also the human_g1k_v37.fasta. I can proceed with the fasta without problem.

ADD REPLY
0
Entering edit mode

Okay, that is strange. I just tried wget (Ubuntu 16.04) and my connection was refused. I am now trying with curl and it seems okay.

curl http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz -O human_g1k_v37.fasta.gz
ADD REPLY
0
Entering edit mode

Hi Kevin, thank you for this tutorial! I am following along to combine my own microarray data with 1000G for downstream analyses.

Would it be possible for you to expand on the reason for Step 6 (excluding variants not on the coding strand). I'm wondering why this step is necessary if I have aligned both the microarray data and 1000G data to the same reference genome.

Thank you in advance for explaining further.

ADD REPLY
0
Entering edit mode

Hi Olivia, we need Step 6 when using array data due to the fact that array probes can target the sense (coding) or non-sense (non-coding) strands, whereas NGS (used by 1000 Genomes) only sequences the coding strand. Integrating array data with the 1000 Genomes dataset can be tricky - I hope that you manage to do it.

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thank you for such a comprehensive tutorial. I am using 1K data in build 38. While using R to draw the plots, I do not have any individuals left in the PED file. It comes as NA instead of TRUE. I have tried to search a different pedigree file but have not found one. Any suggestions are appreciated.

ADD REPLY
0
Entering edit mode

Hi, there's not much that I can do to help, right now. You have not shown any code or examples of input data.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I am using the R script you have shown above to draw the plots.

#Read the eigenvec file
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]

#Read the PED file
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')
PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = '\t')
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] NA

Below is the eigenvec file:

   Principal Component 1 Principal Component 2 Principal Component 3 Principal Component 4
82008            0.00664539            0.00406668            0.00301638          -0.003486120
82009            0.00713539            0.00526112            0.00220775          -0.000691655
82010            0.00689200            0.00476822            0.00343645          -0.001918870
82013            0.00674223            0.00397111            0.00457880          -0.002049680             
ADD REPLY
0
Entering edit mode

Well, the problem is in the code chunk, #Read the PED file. Can you re-check that you have copied my code directly? I can already see that one of your lines is misplaced (out of place).

You will know that it is functioning correctly when this line returns TRUE:

all(PED$Individual.ID == rownames(eigenvec)) == TRUE
ADD REPLY
0
Entering edit mode

Thank you. That is true. However, the code is the same as yours and I still get NA. Manually comparing, I see that there are no common individuals in these two files (the ids are different in the fam file of 1K in build 38 and the ped file). That is why, I was wondering whether there is another PED file for build 38 that I am not able find.

ADD REPLY
0
Entering edit mode

Please use ADD REPLY when responding to existing posts. Do not add comments as new answers to this thread.

ADD REPLY
0
Entering edit mode

I have not looked in depth at the hg38 data, unfortunately. The participant IDs should be the same [I would hope...]. Can you show an example of how the IDs appear in plink.eigenvec?

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thank you so much for this nice explanation !! I followed your steps for 1000 genomes GRCh38 aligned data and got exactly the same nice PCA plot. Wonderful!!

ADD REPLY
0
Entering edit mode

I am doing the same thing as you, my data is also GRCH38, can you please provide your code conveniently? Or where can I find the population information of GRCH38?I can't find the ped file

ADD REPLY
0
Entering edit mode

Hi Kelvin my data version is GRCH38,but I see your Note 3 GRCh38 data is available HERE,where is the 1000 Genomes PED file,I do not find it thank you

ADD REPLY
0
Entering edit mode

Hi, the PED file should be independent of the reference genome build.

ADD REPLY
0
Entering edit mode

Hi, Thanks for the useful and applicable tutorial, I do not have access to input for "eigenvec" file .

Could you please show us an input file for the "11, Generate plots in R" part. I used .pca.evec file from eigensoft, the input is like:

>   #eigvals:    13.726    10.668 
>                 AFF1     0.0425      0.0347             Case
>                 AFF2     0.0409      0.0163             Case
>                 AFF3     0.0372      0.0287             Case
>                 AFM1     0.0394      0.0451             Case
>                 AFM2     0.0511      0.0139             Case
>                 AFM3     0.0186      0.0308             Case
>                 AFS1     0.0377      0.0415             Case
>                 AFS2     0.0519      0.0632             Case
>                 AFS3    0.0702      0.0160             Case

But the codes that you mentioned above are not ok, I would be grateful if there are some advice?

ADD REPLY
0
Entering edit mode

Hi, the eigenvec file should have columns:

FID IID PC1 PC2 ... PC[n]

However, the header line is actually missing.

ADD REPLY
0
Entering edit mode

Thanks, I am sorry for subsequent comments, I'm trying the below codes

*setwd("D:/DD/Mydata/PCA/RUNPCA/plot)
eigenvec <- read.table('eigenvec.txt', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:2), sep = '')
# set colours
library(RColorBrewer)
eigenvec <- factor(eigenvec, levels=c("AFF","ALL","LLT"))
col <- colorRampPalette(c(
  "yellow","grey","black"))(length(unique(eigenvec)))[factor(eigenvec)]* 
ADD REPLY
0
Entering edit mode

Just one question. I need the variants to have either the rsID (when it was in the VCF) or the chr:bp:ref:alt (when no rsID was present in the VCF).

Like this:

chr2:60020:T:TA
chr2:60026:T:C
rs123746
chr2:60034:AC:A
rs66749230

So, would this work in the command 'chr%CHROM\:%POS\:%REF\:%ALT'?

ADD REPLY
0
Entering edit mode

Hi Kevin, thanks for the wonderful tutorial. I managed to create a very similar plot using 1000GP3 hg19 data, but I get a strange split in the data PC1 vs PC3 plot (not particularly noticeable in the PC1 vs PC2 plot) for each population when generating the same plot using hg38 data is this normal, or do you think I am doing something wrong?

1000GP3 hg19 dataset: 466436 variants and 2504 people pass filters and QC. enter image description here

1000GP3 hg38 dataset: 422864 variants and 2548 people pass filters and QC.enter image description here

ADD REPLY
0
Entering edit mode

Hi, there. I have made the same plot with you. Have you got any ideas?

ADD REPLY
0
Entering edit mode

Hi Kevin Need help regarding "NB - if you have your own data that you want to merge with 1000 Genomes" I do have my own data I am confused at step 9b) i.e. "Step 9b: find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)" By filtering did you mean removing the common variants between 1000 genome and my own data? If yes, why are we removing them?

ADD REPLY
1
Entering edit mode

It means restricting both datasets to only the overlapping variants. This is a quick-and-dirty way of integrating; the other alternative would be to run the data through haplotype-based imputation - either imputing your data into 1000G, or imputing both using an even larger and more diverse reference panel. This does take a long time, however.

ADD REPLY
0
Entering edit mode

I am facing an issue when I merged my dataset with 1000 genome, the genotyping percentage of my dataset is 18.3% and thus, while running PCA an error is thrown "Error: Failed to extract eigenvector(s) from GRM". Is it not possible to integrate my own data with low genotyping quality.

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6