Converting Nebula Genomics Data to 23andMe Format

Tool:Converting Nebula Genomics Data to 23andMe Format

0

Entering edit mode

15 months ago

Guillermo • 0

Hi. I've created a bash script as a guide to convert genetic data from Nebula Genomics to 23andMe format.

This script outlines the necessary steps, each of which should be executed and reviewed before proceeding to the next.

Check it out here:

	#!/bin/bash

	# Author: Guillermo Céspedes Tabárez
	# Version: 0.1 (Development - Not Yet Tested)
	# Description: Conversion of Nebula Genomics FASTQ to 23andMe format

	# This Bash script automates the conversion process of genetic data in Nebula Genomics' FASTQ format to 23andMe format, which is compatible with various genetic analysis services.

	# Note: The primary goal is to achieve the highest quality conversion possible, even if it is slow.
	# Any performance improvements without compromising quality are welcome.

	## Script Steps:

	# 1. Install bioinformatics tools.
	# 2. Download necessary files, including the human reference genome (GRCh37).
	# 3. Decompress Nebula Genomics FASTQ files.
	# 4. Convert FASTQ files to BAM format.
	# 5. Sort and index the BAM file.
	# 6. Generate standard VCF and gVCF files.
	# 7. Filter variants (optional).
	# 8. Convert data to 23andMe format.

	##################################################

	# Nebula FASTQ
	nebula_fastq_1="nebula_sample_1.fq.gz"
	nebula_fastq_2="nebula_sample_2.fq.gz"

	# Human reference genome (GRCh37)
	ref_genome_url="https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz"

	# Set the number of threads for bwa
	num_threads=$(nproc)

	# + Step 1: Install essential tools and libraries for compiling bioinformatics tools
	sudo apt-get update
	sudo apt-get install wget gzip git build-essential zlib1g-dev

	# - Install bwa
	git clone https://github.com/lh3/bwa.git
	cd bwa
	make
	sudo cp bwa /usr/local/bin

	# - Install hisat2
	git clone https://github.com/DaehwanKimLab/hisat2.git
	cd hisat2
	make

	# - Install bowtie2
	# TODO

	# - Install samtools
	git clone https://github.com/samtools/samtools.git
	cd samtools
	make
	sudo cp samtools /usr/local/bin

	# - Install gatk
	# Download the latest version from the GATK website: https://software.broadinstitute.org/gatk/download/
	# For example:
	wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip
	unzip gatk-4.5.0.0.zip
	sudo cp gatk-4.5.0.0/gatk /usr/local/bin

	# - Install bcftools
	git clone git://github.com/samtools/bcftools.git
	cd bcftools
	make
	sudo cp bcftools /usr/local/bin

	# - Install plink 1.9
	# Download the latest version from the PLINK website: https://www.cog-genomics.org/plink/
	# For example:
	wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
	unzip plink_linux_x86_64_20231211.zip
	sudo cp plink /usr/local/bin

	# Step 2: Download the human reference genome
	wget -O ref_genome.fa.gz $ref_genome_url
	# decompress the human reference genome
	gunzip -k ref_genome.fa.gz

	# Step 3: Decompress FASTQ
	gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
	gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq

	# Step 4: Convert FASTQ to BAM
	bwa mem -t $num_threads ref_genome.fa nebula_fastq_1.fq nebula_fastq_2.fq \| samtools view -Sb - > aligned.bam
	# Alt: hisat2-build ref_genome.fa hisat2_genome_index
	# Alt: hisat2 -p $num_threads -x hisat2_genome_index -1 nebula_fastq_1.fq -2 nebula_fastq_2.fq -S aligned_reads.sam
	# Alt: Convert SAM to BAM
	# Alt: samtools view -bS aligned_reads.sam > aligned_reads.bam

	# Step 5: Sort and index BAM file
	samtools sort aligned.bam -o sorted.bam
	samtools index sorted.bam

	# + Step 6: Generate standard VCF and gVCF
	# - Standard VCF
	samtools mpileup -uf genome.fa sorted.bam \| bcftools call -mv -Ov -o variants.vcf
	# - gVCF
	gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF

	# Step 7: Filter variants (optional)
	bcftools filter -i 'QUAL>20' variants.vcf \| vcf-to-plink --vcf - --out plink

	# Step 8: Convert to 23andMe format
	plink --file plink --recode 23 --out 23andme

view raw nebula_fastq_to_23andme.sh hosted with ❤ by GitHub

This is version 0.1, and it has not been tested. It's a starting point for the community to refine.

The primary focus is on achieving the highest quality conversion possible, and any performance improvements without compromising quality are welcome.

Your feedback and suggestions are greatly appreciated!

23andMe Nebula • 1.4k views

ADD COMMENT • link updated 15 months ago by Michael 56k • written 15 months ago by Guillermo • 0

1

Entering edit mode

15 months ago

Michael 56k

Hi,

Thank you for your contribution, here is your free code review:

This scenario is sort of ideal for a snakemake workflow and I think you could check if it's worth writing one.
The script should have better separation of concerns (analysis vs. installing software and dependencies)
I am very skeptical about scripts installing stuff via sudo and apt, note not everyone is running Debian
Leave the decision of how to install software to the users
All the software you are installing is available via conda. I recommend providing the dependencies as a conda env export into a yaml file or simply integrate that into the workflow.
Your WF is a basic variant calling pipeline there are may of these already, only the last step is specific. plink --file plink --recode 23 --out 23andme # this is the specific code

Provide filenames as parameters on the command line.

# Step 3: Decompress FASTQ
 gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
 gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq

This is not recommended nor required, remove this step.

 # + Step 6: Generate standard VCF and gVCF
 # - Standard VCF
 samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf
 # - gVCF
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF

It is not clear, why you are running two variant callers, but only use the output of the first. I'd stick with the GATK best-practices workflows or use DeepVariant. GATK wf's include marking duplicates and base-quality recalibration (at least for human data) as well as variant filtration steps.

ADD COMMENT • link 15 months ago by Michael 56k

Login before adding your answer.