Entering edit mode
3.6 years ago
shivangi.agarwal800
▴
120
Hello all,
I have two compare two excel files containing variants in sample 1 and sample 2. For this, which columns we have to take into consideration to find match and mismatch between two files.
Kindly suggest.
Files contain columns as :
Locus Genotype Filter Ref Observed Allele Type No Call Reason Genes Location Length Copy Number CytoBand Info Variant ID Variant Name % Frequency Strand Exon Transcript Coding Amino Acid Change Variant Effect PhyloP SIFT Grantham PolyPhen FATHMM PFAM dbSNP DGV MAF EMAF AMAF GMAF UCSC Common SNPs ExAC LAF ExAC EAAF ExAC OAF ExAC EFAF ExAC SAAF ExAC ENFAF ExAC AAF ExAC GAF COSMIC OMIM Gene Ontology DRA DrugBank ClinVar Allele Coverage Allele Ratio CNV Confidence p-value Phred QUAL Score Coverage CNV Precision Ref+/Ref-/Var+/Var- Homopolymer Length Tiles Subset of
It sounds like you might want
bcftools isec
; https://samtools.github.io/bcftools/bcftools.html#isecI have recently been using it to compare variants between multiple samples. In particular, the
sites.txt
file that gets output has the list of all variant regions specified in the input regions list, and details which samples had that mutation. It also outputs the variants unique to each sample.Note that it seems like some older versions of
bcftools
did not produce thesites.txt
when less than 3 samples were being compared, though that seems to be updated in newer versions.Also note that
bcftools isec
requires .vcf input files, so convert your files into that formatWill the bcftool accept input files in xlsx format..?
It would be best to use @steve's solution with VCF files if you have those. Excel files can't be used with
bcftools
(or many other standard VCF tools).If you have xlsx files, i would suggest to use MS access for such complex joins. Unless you post content of the xlsx file here (2-3 lines, not the images) and explain what you are trying to achieve, it is difficult to address the issue.
@steve unfortunately following bit in original posting makes it so that your solution can't be applied in this case (unless OP has VCF files available). I will move your answer to a comment for now.
well the solution is to convert the files to vcf format of course :)
data scrubbing and conversion is pretty par for the course with bioinformatics
Than you so much Steve.
I tried with .VCF files but I am not getting sites.txt file
bcftools 1.12 Using htslib 1.12 Copyright (C) 2021 Genome Research Ltd. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
without posting input files, it is difficult to address the issue. Take two small vcf files (subset from your vcf files) first, with overlapping records and see if isec works. Then extend it to entire vcf files.
As mentioned by many people here, BCFtools isec is a good way, but you need a VCF file for that.
There is another way which can utilise your excel file. You just need to create an additional column and make a custom variant specific ID in the form- Locus OR Location:Ref:Oberved Allele. You can use the CONCAT() of excel to make this ID.
I can see that there is a column names Variant ID as well which may just contain a rsid (reference SNP ID number). The issue with it is that not all variants will have a rsid (especially if its a private variant). In those cases, a custom ID is helpful.
You make the custom ID for both of your excel files and just compare the ID column between the two files (venny can be helpful here). You can make an extremely specific ID as well by adding Gene name/Exon, etc.
This can also be easily achieved using R.
To add to prasundutta87's comment, if you can create an ID out of CHROM, POS, REF and ALT, you won't need any other field. The combination of these 4 will always be unique for normalized (multi-allelics decomposed) VCF data.
I agree. Adding information like gene names/trascript ID/exon, etc. is helpful when you have two files with variants annotated by two different variant annotation software. However, in this case, the basic custom ID described above will be enough..
Hello Guys,
You are right but I am able to successfully run bcftools, there is no issue but only problem is I am not getting sites.txt which was expected to be generated with version 1.12 as mentioned in the above replies.
Check out the bcftools manual (https://samtools.github.io/bcftools/bcftools.html#isec) in order to see if you are running the command correctly or it is giving output in some other format..it is also possible that there may not be any common sites. You can try checking if you find common sites using a custom ID as well..