Hello! I am in need of some help. I have a VCF file that was generated using a reference genome where the chromosomes are named in roman numerals: chrI, chrII... chrM, chrV, etc. Which means that they are sorted alphabetically and not numerically, therefore my chromosomes have a silly order, with chrM listed in the middle for example (why lord!).
I've tried renaming them using only single digits and letters (1,2,3... M, X, Y) using bcftools annotate
before I generate my bfiles using plink. The issue is that because my chrM was listed somewhere in the middle, when I try to make the bfiles, my BIM file stops when it reaches the M. This is the command I used:
bcftools norm -Ou -m -any $file.vcf.gz |
bcftools norm -Ou -f $ref |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--const-fid \
--allow-extra-chr \
--make-bed \
--chr-set 24 \ #I also tried --output-chr M
--out $file
Is there a simple way to address this in the plink command? I'm trying to figure out a way to sort my VCF so the chrM is listed last also, but so far it has been a struggle and I must be thinking about this wrong! Ugh D:
Hi @chrchang523, thanks for the reply! I'm using plink 1.9 but the variants are not being sorted. At least what I have noticed is that in the BIM file the program gives the chrM the last number, but the order remains the same and the file ends there, so I only have: 1 (chrI), 2 (chrII), 3 (chrIII), 4 (chrIV), 9 (chrIX), M (chrM). My species has 24 chromosomes total (including chrM).
Please post or send me a VCF file that illustrates what you're talking about, along with the plink .log file.
Hi, @chrchang523. I went to check my VCF file and I've noticed that after my chrM I had renamed my chrUn as 0 while my reference file had it named U. I think that instead of skipping this part, the whole thing just stopped there, so I'm re-running this to check. I'm running into --memory issues now, so when I'm done I'll get back here to clarify if the issue still persists.
OK, so it was my fault! I had the chrUn renamed differently on my VCF and REF file so I think that is solved. However, now I'm trying to update de FIDs using the --update-ids command and I'm getting the error:
Invalid chromosome code '28' on line 40749796 of .bim file.
Which is my chrM. Weird is that I did--set-chr 24
. Hmmmm... now I think I understand the instructions of the --chr-set. So I define only the number for autosomes, and the rest the program will recognize automatically as X, Y and M? And will it be treating my data as human, even though I've defined a different set?So, how would you advise I treat my chrUn (unassigned)? Currently it's named just as U. Should I assign it a number and treat it as an autosome?
That is what the
--allow-extra-chr
flag is for.Awesome! Thank you so much for the help. When I defined
--chr-set 20 #20 autosomal (excluding chrUn)
, and the--allow-extra-chr
, I was able to run the --update-ids command with no error.Actually, (sorry @chrchang523, this seems like a never ending issue!), I've just checked my BIM output and there seems to be an issue with chrX. The output is like so:
When I renamed my chromosomes, I did have a chr21 and a chrX. It seems they are being conflated. Is there a way to prevent this?
That's due to your incorrect use of
--chr-set 20
.Hmmmmmmmmmmmm... It's because the way this genome was defined was that chr19 is the chrX. So I renamed from chrXIX to X and therefore it skips the 19 altogether. I was counting as 20 total. Thanks so much!