Hello all,
I was a little confused by this plink output im producing. It was my understanding that --indep-pairwise analysis using a cutoff value for r2 would identify all the SNPs in a .ped file with an r2 correlation value of >.7
$ ./plink --file input.plink --indep-pairwise 225 23 .7 --out plinkoutput1
SNPs with a r2 >.7 would have their IDs saved to a file 'prune.in' while those with a r2 < .7 would be saved to a second file called 'prune.out'. Now, if you want to get the tagging SNPs and a list of all the SNPs they tag, you run something along the lines of
$ plink --file input.plink --show-tags my_tags.txt --list-all --tag-r2 .7 --out mytags_and_what_they_tag
So it makes sense that you could use the prune.in list of SNPs for the list of tags right?
$ plink --file input.plink --show-tags input.prune.in --list-all --tag-r2 .7 --out mytags_and_what_they_tag
and in the output every SNP should appear as either a tag for something, or something that is being tagged by another snp. But when I look at the file:
SNP CHR BP NTAG LEFT RIGHT KBSPAN TAGS
rs79687004 2 5501148 0 5501148 5501148 0 NONE
rs1404257 2 5503200 1 5498227 5503200 4.973 rs10929518
rs61268182 2 5504924 2 5491746 5504924 13.178 rs72776977|rs955146
rs17356301 2 5506274 0 5506274 5506274 0 NONE
I see things like rs79687004. In its 'TAGS' column it has NONE. So its not serving as a tagging SNP for anything. Fair enough, but when I search for it in the file this is the ONLY instance of 'rs79687004'. Its not being listed in some other SNP's TAGS column either. Heres where I am confused: if everything in prune.in is a SNP that correlated to something else at an r2 > .7, wouldnt every single SNP either be identified as tagging something or tagged by something in this output? Ie shouldnt a SNP that is not tagging anything, in addition to having its on line in the output ALSO appear in some other line as a tagged SNP?
The only thing I can think as an explanation is that my scanning window used during --pairwise-indep was 225 SNPs, and this file was a collection of some distant loci in a single chromosome, so maybe there were instances where the --indep-pairwise window fell over a VERY distantly located set of SNPs and found they were correlated, so they went into the 'prune.in' file. But then, during the --show-tags analysis there is a 240kb default 'window' where SNPs further apart than that are not examined for tagging. but I find it hard to swallow that I would get so many lines where a SNP is not tagging OR tagged just because some comparisons won't be made the second time around in --show-tags.
What am I missing here? I've seen other peoples output from the same process posted with "None" in the TAGS column, but no one seems to bat an eye. I feel like I am overlooking something obvious here and I was hoping you could point it out to me.
Many thanks in advance!
Aah. Thank you! That is extremely helpful.
While I'm at it can I ask you a follow up question? If I am trying to create a list of tagging SNPs with which to work, and thus a minimal SNP dataset running
on the 'prune.out' file would provide me with a list of SNPs and what they tag? Is there any criteria that plink uses to determine which is the tag and which is tagged? Or would I have to use something like Haploview's tagger for that? It really doesnt matter which one I use as a tagging SNP, though it would be nice to have some criteria such as 'lowest MAF' or 'highest MAF' to determine it.
I see that in 2014 it another biostar user's question on tagging SNP selection was answered by suggesting they use the extract function:
To automatically chose the SNP with the higher MAF, but that may not be a viable solution: I've already filtered on MAF > 5%...if that complicates things. Also, converting to a bed file causes some trouble as I need a file I can easily parse in perl when this is done, and the '--recode vcf' function in plink 1.9 (Correction: this version is actually 1.07, please see comment below) doesnt seem to be working, giving an error message "trouble parsing command line". So parsing the '--show-tags' file may still be the best option, it would be nice if there was some way to kill two birds with one stone with '--show-tags'.
Thank you again! Regarding the --recode problem, it turns out this version of plink is the old version, not 1.9. I need to strip the old software off of my computer to avoid this in the future.
Regarding the calculations of R2 that produce prune.in and prune out files, is there any difference in the output save speed between version 1.9 and 1.07?
There should be no difference, other than very rare instances of MAFs being considered to be "tied" by one version and microscopically different by the other.
Good to know. By the way, if plink is making choices based on MAF, is there an easy way to get it to print out this value? I'd like to get the SNP ID, its locus, and MAF all in one file if possible.
With plink 2.0, "plink2 --bfile ... --freq cols=+pos" will do the trick. (plink 1.9 --freq does not provide a way to add the POS column.)
Wonderful! Thank you again!
One last question... if do to an error in parsing a vcf I have some duplicate SNPs, I know an easy solution is to simply use --exclude duplicate_list.txt in plink 1.07 as plink 1.07 will simply ignore all but one of those instances of the SNP when it does it thing, but in version 1.9 both instances (or more I assume) will be ignored if its given a list of SNPs with the --exclude function.
What is the situation with plink 2.0? Is there an easy way in either 1.9 or 2.0 to exclude all but one instance of a duplicate SNP at the command line? These programs are so much faster I'd really like to use them with a file that has some duplicates I need ignored.
Many thanks for all of your help by the way, it really has been tremendous!
"plink2 --bfile ... --rm-dup --make-bed --out ..."
This also verifies that genotypes are identical between the duplicate-ID variants (and errors out when this isn't true). "plink2 --help rm-dup" lists other options for handling genotype mismatches.
That worked! Thank you for all of your help. You have been tremendous.