Question

Using Plink to remove individuals in a downloadable dataset

0

Entering edit mode

8.3 years ago

cinnie83 • 0

Hi there,

My goal is to perform a GWAS on RILs, which have all the genotypic information available online in tgeno, bed, bim, fam and vcf formats. I am trying to reduce the size of these files so that they no longer relate to a panel of 205, but a subset (~50) of these RILs.

I created a ped file and then in command line put: plink --file dgrp --keep AllLinesExp.txt

The AllLinesExp.txt file is a list of all the RIL lines (Family ID/Sample ID) that I will be conducting an experiment on. About an hr into the analysis I received an output, with these couple of annoying lines midway through:

Reading individuals to keep [ AllLinesExp.txt ] ... 0 read

205 individuals removed with --keep option

I tried a txt file that just had Family ID's, e.g.:

line_21

line_40

line_65

and then I tried a txt file with Family ID/Sample ID, e.g.:

line_21/line_21

line_40/line_40

line_65/line_65

Using both these txt files yields 0 read individuals, so all are removed and the analysis stops.

Please help :)

Plink • 5.2k views

ADD COMMENT • link 6.7 years ago by cinnie83 • 0

1

Entering edit mode

In your Family ID/Sample ID file, did you have a literal '/' between the IDs, or did you have a space or tab? (Space or tab should work; I'd need more information to figure out what went wrong in that case.)

ADD REPLY • link 8.3 years ago by chrchang523 11k

0

Entering edit mode

I had a slash... Am going to try a space or tab now :)

ADD REPLY • link 8.3 years ago by cinnie83 • 0

0

Entering edit mode

Space only read 1 RIL; but tab worked, and read in my subset of 55 RILs! Thank you!!

ADD REPLY • link 8.3 years ago by cinnie83 • 0

0

Entering edit mode

Hi Cinnie83, I am trying to remove ID patients from my data and I am using the original PED file for doing that. I create a .txt file with the number of ID family and ID patients that I want to remove put in two columns, but it still doesn't work. The analysis seems to go until the end of the process (creating temporary files) when appears the message saying: Error: duplicates ID.

My command is: $ ./plink --file name --remove IDlist.txt --out subset2 --make-bed

And my IDlist.txt is:

1 2204 2 1146

So I know I have few duplicates but I don't understand why the presence of duplicates does not allow the removing process.

How did you sort out your problem? Do you mind explaining here?

ADD REPLY • link 6.7 years ago by Ginevra ▴ 10

0

Entering edit mode

Hi sorry apparently I replied in a new convo down there ˯˯˯

ADD REPLY • link 6.7 years ago by cinnie83 • 0

0

Entering edit mode

In addition to my reply below, I had a quick look around biostars and it seems that if you have any hyphen characters in the VCF file they must match exactly with the text file or it won't work. Also, if using a Mac, you can select what encoding is used on the text file (unsure about PC) - mix up the encoding when you save the file from the default and try Western - Windows as that's worked for someone else on here.

ADD REPLY • link 6.7 years ago by cinnie83 • 0

score 0 · Answer 1 · 2018-03-12

Hello and welcome to the wonderful world of Plink!

I had a bit of a different issue than you, with 0 samples being read in rather than duplicates being found...

However here is the process I did which in the end worked: I used the original FAM file to produce the columns Family ID and Sample ID so they were exactly the same in the Plink file columns and in the text file used to parse the PED file. (In mine, they were called 'line_21', 'line_40' etc and my Family ID and Sample ID were exactly the same. So my text file looked like this (first 2 lines listed):

line_21 line_21

line_40 line_40

... etc listing more lines. It is supposed to work whether you use a space or a tab between your columns - mine only worked using a tab.

There were no column names, just 2 columns of Family ID, Sample ID and listing 55 samples. Should the first line of your IDlist.txt file instead say:

2204 1146

? Or were you just putting in 1 and 2 for my benefit to know they were separate IDs? Or are they part of the ID (should be joined by underscore instead of a space)?

You might have some unknown/hiding characters or spaces in that .txt file - if you are using a Mac I highly recommend Textwrangler as a really simple text file program that allows you to see any unknown/hiding characters in your file.

If you are not using a Mac - I know that on a PC I weirdly had to save my text file firstly as a .txt in Word, and then reopen it in Notepad, resave in that program as a .txt - and then it worked.

I had so many teething problems with getting my GWAS to run. By the time I got it to work, I had run the analysis MANY times with dummy files (much smaller subset of data so it ran quick) I had ONE dummy .txt file that worked successfully and used that specific file in the end by duplicating it and the manually typing in my samples. Plink seems to be ultra-niggly in the use of the txt files to parse down large genetic files.

I hope this is some help!