Tool:Converting Nebula Genomics Data to 23andMe Format
1
0
Entering edit mode
12 months ago
Guillermo • 0

Hi. I've created a bash script as a guide to convert genetic data from Nebula Genomics to 23andMe format.

This script outlines the necessary steps, each of which should be executed and reviewed before proceeding to the next.

Check it out here:

#!/bin/bash
# Author: Guillermo Céspedes Tabárez
# Version: 0.1 (Development - Not Yet Tested)
# Description: Conversion of Nebula Genomics FASTQ to 23andMe format
# This Bash script automates the conversion process of genetic data in Nebula Genomics' FASTQ format to 23andMe format, which is compatible with various genetic analysis services.
# Note: The primary goal is to achieve the highest quality conversion possible, even if it is slow.
# Any performance improvements without compromising quality are welcome.
## Script Steps:
# 1. Install bioinformatics tools.
# 2. Download necessary files, including the human reference genome (GRCh37).
# 3. Decompress Nebula Genomics FASTQ files.
# 4. Convert FASTQ files to BAM format.
# 5. Sort and index the BAM file.
# 6. Generate standard VCF and gVCF files.
# 7. Filter variants (optional).
# 8. Convert data to 23andMe format.
##################################################
# Nebula FASTQ
nebula_fastq_1="nebula_sample_1.fq.gz"
nebula_fastq_2="nebula_sample_2.fq.gz"
# Human reference genome (GRCh37)
ref_genome_url="https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz"
# Set the number of threads for bwa
num_threads=$(nproc)
# + Step 1: Install essential tools and libraries for compiling bioinformatics tools
sudo apt-get update
sudo apt-get install wget gzip git build-essential zlib1g-dev
# - Install bwa
git clone https://github.com/lh3/bwa.git
cd bwa
make
sudo cp bwa /usr/local/bin
# - Install hisat2
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make
# - Install bowtie2
# TODO
# - Install samtools
git clone https://github.com/samtools/samtools.git
cd samtools
make
sudo cp samtools /usr/local/bin
# - Install gatk
# Download the latest version from the GATK website: https://software.broadinstitute.org/gatk/download/
# For example:
wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip
unzip gatk-4.5.0.0.zip
sudo cp gatk-4.5.0.0/gatk /usr/local/bin
# - Install bcftools
git clone git://github.com/samtools/bcftools.git
cd bcftools
make
sudo cp bcftools /usr/local/bin
# - Install plink 1.9
# Download the latest version from the PLINK website: https://www.cog-genomics.org/plink/
# For example:
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
unzip plink_linux_x86_64_20231211.zip
sudo cp plink /usr/local/bin
# Step 2: Download the human reference genome
wget -O ref_genome.fa.gz $ref_genome_url
# decompress the human reference genome
gunzip -k ref_genome.fa.gz
# Step 3: Decompress FASTQ
gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq
# Step 4: Convert FASTQ to BAM
bwa mem -t $num_threads ref_genome.fa nebula_fastq_1.fq nebula_fastq_2.fq | samtools view -Sb - > aligned.bam
# Alt: hisat2-build ref_genome.fa hisat2_genome_index
# Alt: hisat2 -p $num_threads -x hisat2_genome_index -1 nebula_fastq_1.fq -2 nebula_fastq_2.fq -S aligned_reads.sam
# Alt: Convert SAM to BAM
# Alt: samtools view -bS aligned_reads.sam > aligned_reads.bam
# Step 5: Sort and index BAM file
samtools sort aligned.bam -o sorted.bam
samtools index sorted.bam
# + Step 6: Generate standard VCF and gVCF
# - Standard VCF
samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf
# - gVCF
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF
# Step 7: Filter variants (optional)
bcftools filter -i 'QUAL>20' variants.vcf | vcf-to-plink --vcf - --out plink
# Step 8: Convert to 23andMe format
plink --file plink --recode 23 --out 23andme

This is version 0.1, and it has not been tested. It's a starting point for the community to refine.

The primary focus is on achieving the highest quality conversion possible, and any performance improvements without compromising quality are welcome.

Your feedback and suggestions are greatly appreciated!

23andMe Nebula • 1.1k views
ADD COMMENT
1
Entering edit mode
12 months ago
Michael 55k

Hi,

Thank you for your contribution, here is your free code review:

  • This scenario is sort of ideal for a snakemake workflow and I think you could check if it's worth writing one.
  • The script should have better separation of concerns (analysis vs. installing software and dependencies)
  • I am very skeptical about scripts installing stuff via sudo and apt, note not everyone is running Debian
  • Leave the decision of how to install software to the users
  • All the software you are installing is available via conda. I recommend providing the dependencies as a conda env export into a yaml file or simply integrate that into the workflow.
  • Your WF is a basic variant calling pipeline there are may of these already, only the last step is specific. plink --file plink --recode 23 --out 23andme # this is the specific code

  • Provide filenames as parameters on the command line.

    # Step 3: Decompress FASTQ
     gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
     gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq
    

This is not recommended nor required, remove this step.

 # + Step 6: Generate standard VCF and gVCF
 # - Standard VCF
 samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf
 # - gVCF
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF

It is not clear, why you are running two variant callers, but only use the output of the first. I'd stick with the GATK best-practices workflows or use DeepVariant. GATK wf's include marking duplicates and base-quality recalibration (at least for human data) as well as variant filtration steps.

ADD COMMENT

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6