Entering edit mode
2.8 years ago
Zhitian Wu
▴
60
Hi, I am using PLINK to perform quality control of my genotype data. All of the genotypes are homozygous but some are labelled as missing.
This is what my data looks like
SNP1 A T C G C N
But PLINK requires the genotype to be bi-allelic, so I want it to be like this,
SNP1 A A T T C C G G C C N N
There are more than 10 million SNPs, so I wonder if there's the most efficient way to do this. So far, I only know a little about sed and Regex and this is my code.
sed -i 's/\([ATCGN]\)\>/\1\t\1/g' chr05.tped
this would change N in SNP1 too. For this example, I could come up with this:
what is the field separator between SNP1 and bases? Do not use
-i
when you are not sure of the code.Thanks for your reply. The content of this file is actually quite simple so I add a word anchor
\>
to avoid the expression matching the first column (SNP id).The field operator is TAB, I type 3 more spaces between SNP1 and bases to make it look nicer here.
I can understand your expression, I think add a TAB and the same letter will be faster than replace the original letter? Is it possible to do this without replacement?
Do not post images of the data.