Tool:Converting Nebula Genomics Data to 23andMe Format
1
0
Entering edit mode
5 months ago
Guillermo • 0

Hi. I've created a bash script as a guide to convert genetic data from Nebula Genomics to 23andMe format.

This script outlines the necessary steps, each of which should be executed and reviewed before proceeding to the next.

Check it out here:

This is version 0.1, and it has not been tested. It's a starting point for the community to refine.

The primary focus is on achieving the highest quality conversion possible, and any performance improvements without compromising quality are welcome.

Your feedback and suggestions are greatly appreciated!

23andMe Nebula • 625 views
ADD COMMENT
1
Entering edit mode
5 months ago
Michael 55k

Hi,

Thank you for your contribution, here is your free code review:

  • This scenario is sort of ideal for a snakemake workflow and I think you could check if it's worth writing one.
  • The script should have better separation of concerns (analysis vs. installing software and dependencies)
  • I am very skeptical about scripts installing stuff via sudo and apt, note not everyone is running Debian
  • Leave the decision of how to install software to the users
  • All the software you are installing is available via conda. I recommend providing the dependencies as a conda env export into a yaml file or simply integrate that into the workflow.
  • Your WF is a basic variant calling pipeline there are may of these already, only the last step is specific. plink --file plink --recode 23 --out 23andme # this is the specific code

  • Provide filenames as parameters on the command line.

    # Step 3: Decompress FASTQ
     gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
     gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq
    

This is not recommended nor required, remove this step.

 # + Step 6: Generate standard VCF and gVCF
 # - Standard VCF
 samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf
 # - gVCF
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF

It is not clear, why you are running two variant callers, but only use the output of the first. I'd stick with the GATK best-practices workflows or use DeepVariant. GATK wf's include marking duplicates and base-quality recalibration (at least for human data) as well as variant filtration steps.

ADD COMMENT

Login before adding your answer.

Traffic: 1749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6