Hello - please could you help me? I only can do very basic bioinformatics!
I have been given a multi sample VCF file on a linux server.
Samples
There are 5236 samples / patients.
Question
I am interested in the variants of 38 patients. I have the potient's ID numbers. How do I extract them? Do I use bcftools? (that's isntalled). As it is on a secure server I don't have the permission to install software.
Thank you!
thank you 4galaxy77 - that is really kind. I tried searching a lot yesterday but didn't get far. I suppose - the novice not knowing what to look for. The shouting is more my desperation in trying to understand bioinformatics, not aimed at anyone. Apologies.
If the VCF file has the patient IDs as its sample IDs, you should be able to use
bcftools view -Ov -S <ids_file_with_one_id_per_line> input.vcf > output.vcf
If your patient IDs don't match the sample IDs in the VCF, you'll need to find the sample IDs that correspond to your patient IDs and then do the above.
To view all the sample IDs in your VCF file, use:
I can't thank you enough!!!!!! thank you so so so much. This worked like a dream for me. Such a relief!!! Julia :)
Could I ask for help again? I now need to do it the other way around. I have variants that I am interested in and I need to extract from the multi sample VCF file the patient IDs that these variants match to. Should I just create a list_of_variants.txt? What would the format for that txt file (position ref alt - 1x per line?) Thank you again, Julia
You should look at the
-R
and-T
options. Start small - use a file with 3-5 loci. Once you get that working, expand to your full set of loci.Thank you very much for that. I will give that a go. Julia
Thank you Ram, for your help that has worked for me. Could I please ask for some further advice? The file itself has all the>5000 patients results for the specfic "position ref alt" i am looking at and then lists all the patients per tab with their GT at the position . I would prefer not to manually filter the het/hom. I have tried to use various commands suggested on here but no success. I have tried ( no change to the file when i look at the output, still all GT counts whether o/o, o/1, or 1/1 still included):
I also tried this (but thsi removes all my GT information, including my info on position, reference etc ...).
Any other command I could use?
many thanks, Julia
I would keep exploring bcftools - unfortunately, I cannot help you more than this as we'd need quite a bit of back and forth. You're on the right track though. Just keep in mind that at some point, you may want to get data in a tabular format using
bcftools query
and then move to R to make calculations easier. Counting, grouping etc become a lot easier when you're working with statistical/data management software.Hello Ram, that is really helpful. Thank you so much. Ok, I need to familiarise myself with R as well then, makes sense. I'll persist with bcftools. I need to grow my confidence in using these softwares. Julia