Question

Subsetting individuals in PLINK when IDs contain underscores "_"

0

Entering edit mode

7.6 years ago

GabrielMontenegro ▴ 680

I am working with a VCF dataset and many individuals have IDs including an underscore.

I want to subset them using plink, but I keep getting this error:

Error: More than two instances of '_' in sample ID.

Is there a way for plink to ignore the underscores in those IDs and treat them as a single ID?

Thanks

This is an example of the IDs in my VCF file:

S_Eskimo_Sireniki-1.Sir26

plink • 4.0k views

ADD COMMENT • link updated 7.6 years ago by pfs ▴ 280 • written 7.6 years ago by GabrielMontenegro ▴ 680

score 0 · Answer 1 · 2017-09-26

Plink documentation recommends converting underscores in an ID to a different character. Why not just use sed to change the underscores to another character?

Below if from PLINK. https://www.cog-genomics.org/plink2/input The family and within-family IDs default to 'FAM001' and 'ID001' respectively if you don't provide them. Due to how the PLINK 1 binary fileset format is defined, they cannot contain spaces3. Since some PLINK commands merge the family ID and within-family ID with an underscore in their reports, we recommend using another character (such as '~') to separate compound name components. (If you don't have to distinguish between e.g. 'Mac Donald' and 'MacDonald', upper CamelCase will also do.)