Interpreting plink snp tagging output
1
1
Entering edit mode
5.7 years ago
RNAseqer ▴ 280

Hello all,

I was a little confused by this plink output im producing. It was my understanding that --indep-pairwise analysis using a cutoff value for r2 would identify all the SNPs in a .ped file with an r2 correlation value of >.7

$ ./plink --file input.plink --indep-pairwise 225 23 .7 --out plinkoutput1

SNPs with a r2 >.7 would have their IDs saved to a file 'prune.in' while those with a r2 < .7 would be saved to a second file called 'prune.out'. Now, if you want to get the tagging SNPs and a list of all the SNPs they tag, you run something along the lines of

$ plink --file input.plink --show-tags my_tags.txt --list-all --tag-r2 .7 --out mytags_and_what_they_tag

So it makes sense that you could use the prune.in list of SNPs for the list of tags right?

$ plink --file input.plink --show-tags input.prune.in --list-all --tag-r2 .7 --out mytags_and_what_they_tag

and in the output every SNP should appear as either a tag for something, or something that is being tagged by another snp. But when I look at the file:

    SNP  CHR         BP NTAG       LEFT      RIGHT   KBSPAN TAGS
 rs79687004    2    5501148    0    5501148    5501148        0 NONE
  rs1404257    2    5503200    1    5498227    5503200    4.973 rs10929518
 rs61268182    2    5504924    2    5491746    5504924   13.178 rs72776977|rs955146
 rs17356301    2    5506274    0    5506274    5506274        0 NONE

I see things like rs79687004. In its 'TAGS' column it has NONE. So its not serving as a tagging SNP for anything. Fair enough, but when I search for it in the file this is the ONLY instance of 'rs79687004'. Its not being listed in some other SNP's TAGS column either. Heres where I am confused: if everything in prune.in is a SNP that correlated to something else at an r2 > .7, wouldnt every single SNP either be identified as tagging something or tagged by something in this output? Ie shouldnt a SNP that is not tagging anything, in addition to having its on line in the output ALSO appear in some other line as a tagged SNP?

The only thing I can think as an explanation is that my scanning window used during --pairwise-indep was 225 SNPs, and this file was a collection of some distant loci in a single chromosome, so maybe there were instances where the --indep-pairwise window fell over a VERY distantly located set of SNPs and found they were correlated, so they went into the 'prune.in' file. But then, during the --show-tags analysis there is a 240kb default 'window' where SNPs further apart than that are not examined for tagging. but I find it hard to swallow that I would get so many lines where a SNP is not tagging OR tagged just because some comparisons won't be made the second time around in --show-tags.

What am I missing here? I've seen other peoples output from the same process posted with "None" in the TAGS column, but no one seems to bat an eye. I feel like I am overlooking something obvious here and I was hoping you could point it out to me.

Many thanks in advance!

plink SNP • 3.1k views
ADD COMMENT
3
Entering edit mode
5.7 years ago

It's the .prune.out SNPs which are guaranteed to have r^2 > .7 with another SNP here. A .prune.in SNPs might tag other SNPs, or it might just tag itself.

ADD COMMENT
0
Entering edit mode

Aah. Thank you! That is extremely helpful.

While I'm at it can I ask you a follow up question? If I am trying to create a list of tagging SNPs with which to work, and thus a minimal SNP dataset running

$ plink --file input.plink --show-tags input.prune.out--list-all --tag-r2 .7 --out mytags_and_what_they_tag

on the 'prune.out' file would provide me with a list of SNPs and what they tag? Is there any criteria that plink uses to determine which is the tag and which is tagged? Or would I have to use something like Haploview's tagger for that? It really doesnt matter which one I use as a tagging SNP, though it would be nice to have some criteria such as 'lowest MAF' or 'highest MAF' to determine it.

I see that in 2014 it another biostar user's question on tagging SNP selection was answered by suggesting they use the extract function:

plink --bfile Affy550K --indep-pairwise 100 10 0.8
plink --bfile Affy550K --extract plink.prune.in --make-bed --out Affy550KPruned

To automatically chose the SNP with the higher MAF, but that may not be a viable solution: I've already filtered on MAF > 5%...if that complicates things. Also, converting to a bed file causes some trouble as I need a file I can easily parse in perl when this is done, and the '--recode vcf' function in plink 1.9 (Correction: this version is actually 1.07, please see comment below) doesnt seem to be working, giving an error message "trouble parsing command line". So parsing the '--show-tags' file may still be the best option, it would be nice if there was some way to kill two birds with one stone with '--show-tags'.

ADD REPLY
1
Entering edit mode
  1. --indep and --indep-pairwise preferentially keep higher-MAF SNPs. (You can change this to an arbitrary priority order by using --read-freq to feed plink fake MAFs.)
  2. Can you post your failing "--recode vcf" command line, and the full .log file?
ADD REPLY
0
Entering edit mode

Thank you again! Regarding the --recode problem, it turns out this version of plink is the old version, not 1.9. I need to strip the old software off of my computer to avoid this in the future.

Regarding the calculations of R2 that produce prune.in and prune out files, is there any difference in the output save speed between version 1.9 and 1.07?

ADD REPLY
1
Entering edit mode

There should be no difference, other than very rare instances of MAFs being considered to be "tied" by one version and microscopically different by the other.

ADD REPLY
0
Entering edit mode

Good to know. By the way, if plink is making choices based on MAF, is there an easy way to get it to print out this value? I'd like to get the SNP ID, its locus, and MAF all in one file if possible.

ADD REPLY
1
Entering edit mode

With plink 2.0, "plink2 --bfile ... --freq cols=+pos" will do the trick. (plink 1.9 --freq does not provide a way to add the POS column.)

ADD REPLY
0
Entering edit mode

Wonderful! Thank you again!

One last question... if do to an error in parsing a vcf I have some duplicate SNPs, I know an easy solution is to simply use --exclude duplicate_list.txt in plink 1.07 as plink 1.07 will simply ignore all but one of those instances of the SNP when it does it thing, but in version 1.9 both instances (or more I assume) will be ignored if its given a list of SNPs with the --exclude function.

What is the situation with plink 2.0? Is there an easy way in either 1.9 or 2.0 to exclude all but one instance of a duplicate SNP at the command line? These programs are so much faster I'd really like to use them with a file that has some duplicates I need ignored.

Many thanks for all of your help by the way, it really has been tremendous!

ADD REPLY
1
Entering edit mode

"plink2 --bfile ... --rm-dup --make-bed --out ..."

This also verifies that genotypes are identical between the duplicate-ID variants (and errors out when this isn't true). "plink2 --help rm-dup" lists other options for handling genotype mismatches.

ADD REPLY
0
Entering edit mode

That worked! Thank you for all of your help. You have been tremendous.

ADD REPLY

Login before adding your answer.

Traffic: 2065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6