Is it Possible to "Merge" data from Replicates within a VCF SNP File
2
0
Entering edit mode
6 months ago

I have VCF SNP dataset that contains 55K markers and 623 samples. We included in our analysis replicates of some samples that are coded with the exact same name. We also know there are replicates with different names as they came from separate sources.

Is it possible to fill in all the "missing" data in one replicate of the sample using the other replicate as the reference. I would want to only fill in missing data and not change anything else.

I have been able to write some R code that can do this to files in STRUCTURE format as that format is amenable to import into R.

I am wondering if it is possible to do something similar directly to the VCF file or to all the .bed,.fam,.bim files generated from PLINK from a VCF.

I have looked into VCFtools and bcfTools but the merge funtions and collate functions of those programs are for multiple file sets. That is not what I am trying to do. I am trying to alter data within a single file.

Here is my R code if this makes it any clearer what exatcly I am trying to do.

My DF is R_Test_DFSg60

##Replacing 0s in row 371 with the values from row 573 if not = 0, In STRUCTURE format
for (col in 2:ncol(R_Test_DFSg60)) {

   #Check if the value in row 371 of the current column is "0"

   if (R_Test_DFSg60[371, col] == 0) {

     # Check if the value in row 573 of the current column is not "0"
     if (R_Test_DFSg60[573, col] != 0) {

       # Check if the values in row 371 and row 573 of the current column are different
       if (R_Test_DFSg60[371, col] != R_Test_DFSg60[573, col]) {

         # Replace the value in row 371 of the current column with the value from row 573
         R_Test_DFSg60[371, col] <- R_Test_DFSg60[573, col]
       }
     }
   }
 }

#Then delete row 573, removing it from the dataset
snp vcf genomics plink • 1.1k views
ADD COMMENT
0
Entering edit mode
6 months ago

use bcftools merge with --force-samples and then use jvarkit https://jvarkit.readthedocs.io/en/latest/VcfFilterJdk/ to replace the no-call with another called genotype for the same sample.

bcftools merge --force-samples --file-list files.list |  java -jar  jvarkit.jar vcffilterjdk -f biostar.code 

with biostar.code:

final UnaryOperator<String> toSample = S->{
    int colon = S.indexOf(":");
    return colon==-1?S:S.substring(colon+1);
    };
final List<Genotype> genotypes = new ArrayList<>(variant.getNSamples());
for(Genotype gt1: variant.getGenotypes()) {
    if(!gt1.isNoCall()) {
        genotypes.add(gt1);
        }
    else
        {
        Genotype newGt = gt1;
        final String sn1 = toSample.apply(gt1.getSampleName());
        for(Genotype gt2: variant.getGenotypes()) {
            if(gt2.isNoCall()) continue;
            final String sn2 = toSample.apply(gt2.getSampleName());
            if(!sn1.equals(sn2)) continue;
            newGt = new GenotypeBuilder(gt2).name(gt1.getSampleName()).make();
            break;
            }
        genotypes.add(newGt);
        }
    }

return new VariantContextBuilder(variant).genotypes(genotypes).make();

and then downstream, you can use bcftools view --samples-file to remove the duplicate samples.

ADD COMMENT
0
Entering edit mode

Thank you Pierre, I am going to take some time to get my head around this and report back if it works.

ADD REPLY
0
Entering edit mode

Pierre,

I am a novice at all of this in general and I have never used Java-based programming before. I may not be able to properly wrangle this with my ability. I think I am mostly struggling with what would need to be coded to my dataset here and how to run this loop on my set in general. Thank you again for the help.

ADD REPLY
0
Entering edit mode

I think I am mostly struggling with what would need to be coded to my dataset here and how to run this loop on my set in general.

i don't understand your question. You 'just' have to execute this command line:

bcftools merge --force-samples --file-list files.list |  java -jar  jvarkit.jar vcffilterjdk -f biostar.code > out.vcf
ADD REPLY
0
Entering edit mode

Hi Pierre,

I am attempting to run this code and am running into this error:

(jvar) $ jvarkit vcffilterjdk  -f biostar.code Dart_VCF2_Original.vcf > Fileted_Dart2.vcf

[SEVERE][VcfFilterJdk]Cannot read non-existent file: file:///local/home/User/VCF_Filtering/biostar.code
htsjdk.samtools.SAMException: Cannot read non-existent file: file:///local/home/User/VCF_Filtering/biostar.code
        at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:498)
        at com.github.lindenb.jvarkit.io.IOUtils.slurpPath(IOUtils.java:868)
        at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.run(VcfFilterJdk.java:534)
        at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.doWork(VcfFilterJdk.java:800)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:819)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:982)
        at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.main(VcfFilterJdk.java:817)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral$Command.execute(JvarkitCentral.java:290)
        at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.run(JvarkitCentral.java:777)
        at com.github.lindenb.jvarkit.tools.jvarkit.JvarkitCentral.main(JvarkitCentral.java:788)
[INFO][Launcher]vcffilterjdk Exited with failure (-1)

If I try to run it without the file input i get an error for not being able to find the #CHROM header.

Any more assistance would be greatly appreciated.

ADD REPLY
0
Entering edit mode

Cannot read non-existent file: file:///local/home/User/VCF_Filtering/biostar.code

your file doesn't exist, is at the wrong place etc.

ADD REPLY
0
Entering edit mode
5 months ago

another answer: I just wrote a program that only keep one genotype when the VCF comes from bcftools merge --force-samples

$ bcftools merge --force-samples src/test/resources/rotavirus_rf.vcf.gz src/test/resources/rotavirus_rf.freebayes.vcf.gz  |\
    grep "#CHROM" -A 3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3  S4  S5  2:S5    2:S2    2:S4    2:S3    2:S1
RF01    243 .   A   C   0   .   DPB=28;EPPR=3.37221;GTI=0;MQMR=60;NS=5;NUMALT=1;ODDS=7.64661;PAIREDR=1;PQR=0;PRO=0;QR=408;RO=24;RPPR=3.37221;SRF=24;SRP=55.1256;SRR=0;DP=28;AB=0.5;ABP=3.0103;AF=0.1;AO=4;CIGAR=1X;DPRA=1.66667;EPP=11.6962;LEN=1;MEANALT=1;MQM=60;PAIRED=1;PAO=0;PQA=0;QA=68;RPL=4;RPP=11.6962;RPR=0;RUN=1;SAF=4;SAP=11.6962;SAR=0;TYPE=snp;AN=10;AC=1 GT:AD:AO:DP:PL:QA:QR:RO ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   0/0:4,0:0:4:0,12,63:0:68:4  0/0:7,1:1:8:0,7,92:17:119:7 0/1:2,2:2:4:20,0,20:34:34:2 0/0:7,1:1:8:0,7,92:17:119:7 0/0:4,0:0:4:0,12,63:0:68:4
RF01    280 .   A   C   0   .   DPB=24;EPPR=4.58955;GTI=0;MQMR=60;NS=5;NUMALT=1;ODDS=6.90616;PAIREDR=1;PQR=0;PRO=0;QR=374;RO=22;RPPR=4.58955;SRF=22;SRP=50.7827;SRR=0;DP=24;AB=0.4;ABP=3.44459;AF=0.1;AO=2;CIGAR=1X;DPRA=1.05263;EPP=3.0103;LEN=1;MEANALT=1;MQM=60;PAIRED=1;PAO=0;PQA=0;QA=34;RPL=1;RPP=3.0103;RPR=1;RUN=1;SAF=2;SAP=7.35324;SAR=0;TYPE=snp;AN=10;AC=1  GT:AD:AO:DP:PL:QA:QR:RO ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   0/1:3,2:2:5:17,0,33:34:51:3 0/0:7,0:0:7:0,21,109:0:119:7    0/0:2,0:0:2:0,6,32:0:34:2   0/0:7,0:0:7:0,21,109:0:119:7    0/0:3,0:0:3:0,9,48:0:51:3
RF01    351 .   T   A   0   .   DPB=25;EPPR=3.94093;GTI=1;MQMR=60;NS=5;NUMALT=1;ODDS=8.27411;PAIREDR=1;PQR=0;PRO=0;QR=357;RO=21;RPPR=3.94093;SRF=21;SRP=48.6112;SRR=0;DP=25;AB=0.4;ABP=3.87889;AF=0.2;AO=4;CIGAR=1X;DPRA=1;EPP=11.6962;LEN=1;MEANALT=1;MQM=60;PAIRED=1;PAO=0;PQA=0;QA=68;RPL=0;RPP=11.6962;RPR=4;RUN=1;SAF=4;SAP=11.6962;SAR=0;TYPE=snp;AN=10;AC=2  GT:AD:AO:DP:PL:QA:QR:RO ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   0/0:7,0:0:7:0,21,109:0:119:7    0/1:3,2:2:5:17,0,33:34:51:3 0/0:4,0:0:4:0,12,63:0:68:4  0/1:3,2:2:5:17,0,33:34:51:3 0/0:4,0:0:4:0,12,63:0:68:4


$ bcftools merge --force-samples src/test/resources/rotavirus_rf.vcf.gz src/test/resources/rotavirus_rf.freebayes.vcf.gz  |\
    java -jar dist/jvarkit.jar bcftoolsmergebest |\
    grep "#CHROM" -A 3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S3  S4  S5  S1  S2
RF01    243 .   A   C   0   .   AB=0.5;ABP=3.0103;AC=1;AF=0.1;AN=10;AO=4;CIGAR=1X;DP=28;DPB=28;DPRA=1.66667;EPP=11.6962;EPPR=3.37221;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=5;NUMALT=1;ODDS=7.64661;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=68;QR=408;RO=24;RPL=4;RPP=11.6962;RPPR=3.37221;RPR=0;RUN=1;SAF=4;SAP=11.6962;SAR=0;SRF=24;SRP=55.1256;SRR=0;TYPE=snp GT:AD:AO:DP:PL:QA:QR:RO 0/0:7,1:1:8:0,7,92:17:119:7 0/1:2,2:2:4:20,0,20:34:34:2 0/0:4,0:0:4:0,12,63:0:68:4  ./. ./.
RF01    280 .   A   C   0   .   AB=0.4;ABP=3.44459;AC=1;AF=0.1;AN=10;AO=2;CIGAR=1X;DP=24;DPB=24;DPRA=1.05263;EPP=3.0103;EPPR=4.58955;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=5;NUMALT=1;ODDS=6.90616;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=34;QR=374;RO=22;RPL=1;RPP=3.0103;RPPR=4.58955;RPR=1;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=22;SRP=50.7827;SRR=0;TYPE=snp  GT:AD:AO:DP:PL:QA:QR:RO 0/0:7,0:0:7:0,21,109:0:119:7    0/0:2,0:0:2:0,6,32:0:34:2   0/1:3,2:2:5:17,0,33:34:51:3 ./. ./.
RF01    351 .   T   A   0   .   AB=0.4;ABP=3.87889;AC=2;AF=0.2;AN=10;AO=4;CIGAR=1X;DP=25;DPB=25;DPRA=1;EPP=11.6962;EPPR=3.94093;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=5;NUMALT=1;ODDS=8.27411;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=68;QR=357;RO=21;RPL=0;RPP=11.6962;RPPR=3.94093;RPR=4;RUN=1;SAF=4;SAP=11.6962;SAR=0;SRF=21;SRP=48.6112;SRR=0;TYPE=snp  GT:AD:AO:DP:PL:QA:QR:RO 0/1:3,2:2:5:17,0,33:34:51:3 0/0:4,0:0:4:0,12,63:0:68:4  0/0:7,0:0:7:0,21,109:0:119:7    ./. ./.
ADD COMMENT

Login before adding your answer.

Traffic: 2353 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6