Hi. I've created a bash script as a guide to convert genetic data from Nebula Genomics to 23andMe format.
This script outlines the necessary steps, each of which should be executed and reviewed before proceeding to the next.
Check it out here:
#!/bin/bash | |
# Author: Guillermo Céspedes Tabárez | |
# Version: 0.1 (Development - Not Yet Tested) | |
# Description: Conversion of Nebula Genomics FASTQ to 23andMe format | |
# This Bash script automates the conversion process of genetic data in Nebula Genomics' FASTQ format to 23andMe format, which is compatible with various genetic analysis services. | |
# Note: The primary goal is to achieve the highest quality conversion possible, even if it is slow. | |
# Any performance improvements without compromising quality are welcome. | |
## Script Steps: | |
# 1. Install bioinformatics tools. | |
# 2. Download necessary files, including the human reference genome (GRCh37). | |
# 3. Decompress Nebula Genomics FASTQ files. | |
# 4. Convert FASTQ files to BAM format. | |
# 5. Sort and index the BAM file. | |
# 6. Generate standard VCF and gVCF files. | |
# 7. Filter variants (optional). | |
# 8. Convert data to 23andMe format. | |
################################################## | |
# Nebula FASTQ | |
nebula_fastq_1="nebula_sample_1.fq.gz" | |
nebula_fastq_2="nebula_sample_2.fq.gz" | |
# Human reference genome (GRCh37) | |
ref_genome_url="https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz" | |
# Set the number of threads for bwa | |
num_threads=$(nproc) | |
# + Step 1: Install essential tools and libraries for compiling bioinformatics tools | |
sudo apt-get update | |
sudo apt-get install wget gzip git build-essential zlib1g-dev | |
# - Install bwa | |
git clone https://github.com/lh3/bwa.git | |
cd bwa | |
make | |
sudo cp bwa /usr/local/bin | |
# - Install hisat2 | |
git clone https://github.com/DaehwanKimLab/hisat2.git | |
cd hisat2 | |
make | |
# - Install bowtie2 | |
# TODO | |
# - Install samtools | |
git clone https://github.com/samtools/samtools.git | |
cd samtools | |
make | |
sudo cp samtools /usr/local/bin | |
# - Install gatk | |
# Download the latest version from the GATK website: https://software.broadinstitute.org/gatk/download/ | |
# For example: | |
wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip | |
unzip gatk-4.5.0.0.zip | |
sudo cp gatk-4.5.0.0/gatk /usr/local/bin | |
# - Install bcftools | |
git clone git://github.com/samtools/bcftools.git | |
cd bcftools | |
make | |
sudo cp bcftools /usr/local/bin | |
# - Install plink 1.9 | |
# Download the latest version from the PLINK website: https://www.cog-genomics.org/plink/ | |
# For example: | |
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip | |
unzip plink_linux_x86_64_20231211.zip | |
sudo cp plink /usr/local/bin | |
# Step 2: Download the human reference genome | |
wget -O ref_genome.fa.gz $ref_genome_url | |
# decompress the human reference genome | |
gunzip -k ref_genome.fa.gz | |
# Step 3: Decompress FASTQ | |
gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq | |
gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq | |
# Step 4: Convert FASTQ to BAM | |
bwa mem -t $num_threads ref_genome.fa nebula_fastq_1.fq nebula_fastq_2.fq | samtools view -Sb - > aligned.bam | |
# Alt: hisat2-build ref_genome.fa hisat2_genome_index | |
# Alt: hisat2 -p $num_threads -x hisat2_genome_index -1 nebula_fastq_1.fq -2 nebula_fastq_2.fq -S aligned_reads.sam | |
# Alt: Convert SAM to BAM | |
# Alt: samtools view -bS aligned_reads.sam > aligned_reads.bam | |
# Step 5: Sort and index BAM file | |
samtools sort aligned.bam -o sorted.bam | |
samtools index sorted.bam | |
# + Step 6: Generate standard VCF and gVCF | |
# - Standard VCF | |
samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf | |
# - gVCF | |
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF | |
# Step 7: Filter variants (optional) | |
bcftools filter -i 'QUAL>20' variants.vcf | vcf-to-plink --vcf - --out plink | |
# Step 8: Convert to 23andMe format | |
plink --file plink --recode 23 --out 23andme |
This is version 0.1, and it has not been tested. It's a starting point for the community to refine.
The primary focus is on achieving the highest quality conversion possible, and any performance improvements without compromising quality are welcome.
Your feedback and suggestions are greatly appreciated!