Removing contamination with SNP tools
1
1
Entering edit mode
3.0 years ago
alebaars_98 ▴ 10

Hello everyone,

Currently, I'm working on a ChIPseq dataset where I will analyze chromatin marks on transposons and genes in a fungus. Unfortunately, I got some contamination in my data from a closely related species. Because they are so similar, removing contamination based on alignment quality is very unlikely to work since the differences are so small. The differences mostly consist of a single or a few nucleotide(s). With that in mind, we realized that searching for these locations could be treated as looking for a single nucleotide polymorphism. The problem here is that while there are many good tools to find SNPs, I cannot find anything that could remove reads containing one from my BAM file. Does anyone know of a tool that could do this? Or alternatively, is there another way to tackle this problem?

Thanks in advance for any help here. It's been nagging for a while.

filtering SNP BAM ChIPseq contamination • 1.4k views
ADD COMMENT
1
Entering edit mode
3.0 years ago

I quickly wrote http://lindenb.github.io/jvarkit/Biostar9501110.html

usage:

java -jar dist/biostar9501110.jar --inverse -V index.vcf.gz input.bam

It should work for simple SNV, I didn't test for indels. Tell me if you think it's ok.

ADD COMMENT
0
Entering edit mode

Hi Pierre,

Thank you for providing the script. I installed it and ran the test command which works fine, but on my own data, I am experiencing an issue:

alejandro@wildtype2:~:java -jar biostar9501110.jar --bamcompression 9 --inverse --samoutputformat BAM -R data/ref/unmasked/Verticillium_longisporum_VlPD589_HiC-improved_chromosomes.fasta -V data/SNPs/PD589/H3K27M3_A.vcf.gz data/alignment/masked/bwa/PD589/PD589_H3K27M3_A.dupl_rm.bam > data/alignment/masked/bwa/SNPs_filtered/PD589/PD589_H3K27M3_A.filtered.bam
[SEVERE][MultiBamLauncher]contig is null
java.lang.IllegalArgumentException: contig is null
        at com.github.lindenb.jvarkit.samtools.util.SimpleInterval.<init>(SimpleInterval.java:76)
        at com.github.lindenb.jvarkit.variant.vcf.BufferedVCFReader.query(BufferedVCFReader.java:122)
        at htsjdk.variant.vcf.VCFReader.query(VCFReader.java:60)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.findVariants(Biostar9501110.java:171)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.lambda$createSAMRecordFunction$1(Biostar9501110.java:201)
        at com.github.lindenb.jvarkit.jcommander.OnePassBamLauncher.scanIterator(OnePassBamLauncher.java:142)
        at com.github.lindenb.jvarkit.jcommander.OnePassBamLauncher.processInput(OnePassBamLauncher.java:153)
        at com.github.lindenb.jvarkit.jcommander.MultiBamLauncher.doWork(MultiBamLauncher.java:245)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:796)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:959)
        at com.github.lindenb.jvarkit.tools.biostar.Biostar9501110.main(Biostar9501110.java:207)
[INFO][Launcher]biostar9501110 Exited with failure (-1)

It looks like something is wrong with my input. I'll describe how I got it, and let me know if there's anything I should do differently.

The BAM files were obtained by aligning with BWA-MEM and removing the duplicates with picard MarkDuplicates. They are sorted and indexed. The VCF files were generated with bcftools mpileup for specific regions of interest. They were bgzipped (with index created) ad then indexed using tabix. No other parameters were tweaked.

ADD REPLY
0
Entering edit mode

how _strange_ ... please, what is the output of

samtools idxstats data/alignment/masked/bwa/PD589/PD589_H3K27M3_A.dupl_rm.bam

and

tabix -l data/SNPs/PD589/H3K27M3_A.vcf.gz

please.

Please, use https://github.com/lindenb/jvarkit/issues for other questions.

ADD REPLY
0
Entering edit mode

ah no ! got it ! I forgot to test if the read was unmapped and all the reads are mapped in my test file. Give me a few minutes...

ADD REPLY
0
Entering edit mode

... and done ! I fixed the bug, can you please update the code and tell me if it works ?

ADD REPLY
0
Entering edit mode

It works now. To test the result, I quickly generated a new vcf file from the resulting BAM file. It does still contain indels, but they are rare and should not affect the results too much. All SNPs are gone. Thank you very much for the tool.

ADD REPLY

Login before adding your answer.

Traffic: 1984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6