Advice for analyzing SnpEff results
0
0
Entering edit mode
5 months ago
ekirsch • 0

Hello!

I am hoping to get some recommendations for how to analyze the large output files I got after running SnpEff to compare genetic load across populations. After running SnpEff, I used SnpSift to create three different vcf files called high_impact_variants.vcf.gz, low_impact_variants.vcf.gz, and moderate_impact_variants.vcf.gz. I then used the following script to create txt files containing the number of mutations of each impact class for each sample ID.

# Define paths
INPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/data"
OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Extract sample IDs and counts of each type of mutation
for vcf_file in "${vcf_files[@]}"; do
    base_name=$(basename "$vcf_file" .vcf.gz)
    output_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
    echo -e "Sample\tCount" > "$output_file"

    # Extract sample IDs and genotype information
    bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN[\t%SAMPLE:%GT]\n' "$INPUT_DIR/$vcf_file" | \
    awk -F'\t' '
    {
        for (i=6; i<=NF; i++) {
            split($i, a, ":");
            sample=a[1];
            gt=a[2];
            if (gt != "./.") {
                count[sample]++;
            }
        }
    }
    END {
        for (sample in count) {
            print sample "\t" count[sample];
        }
    }' >> "$output_file"
done

Now I am trying to normalize the results by the number of alternative alleles using this script:

OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
BASE_PATH="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Read allele counts into a single associative array
declare -A alt_allele_count
for subspecies in "alaudinus" "brooksi" "beldingi" "nevadensis" "rostratus"; do
    allele_count_file="${BASE_PATH}/${subspecies}_allele_counts.txt"
    if [[ -f "$allele_count_file" ]]; then
        while IFS=$'\t' read -r chrom pos total_alleles ref_alleles alt_alleles; do
            key="${chrom}_${pos}"
            alt_allele_count["$key"]=$alt_alleles
        done < "$allele_count_file"
    else
        echo "Allele count file for $subspecies does not exist."
    fi
done
# Normalize counts using precomputed allele counts
for vcf_file in "${vcf_files[@]}"; do
    base_name=$(basename "$vcf_file" .vcf.gz)
    sample_count_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
    norm_output_file="${OUTPUT_DIR}/${base_name}_normalized_sample_counts.txt"
    if [[ -f "$sample_count_file" ]]; then
        awk -F'\t' -v OFS='\t' '
        BEGIN {
            for (key in alt_allele_count) {
                split(key, arr, "_")
                chrom = arr[1]
                pos = arr[2]
                alt_count[chrom][pos] = alt_allele_count[key]
            }
        }
        NR==1 {print $0, "NormalizedCount"}
        NR>1 {
            sample=$1
            count=$2
            key = sample
            alt_count = (key in alt_count ? alt_count[key] : 0)
            if (alt_count > 0) {
                print $0, (count / alt_count)
            } else {
                print $0, 0
            }
        }' "${sample_count_file}" > "${norm_output_file}"
    else
        echo "Sample count file for ${base_name} does not exist."
    fi
done

Does this seem like a robust way to analyze SnpEff results? I have not worked with this type of data before so any advice for dealing with these results would be greatly appreciated :)

snpsift snpeff vcf • 379 views
ADD COMMENT
0
Entering edit mode

what is the scientific question you're trying to answer here ?

ADD REPLY
0
Entering edit mode

I am trying to see if certain subspecies have more high-impact mutations than other subspecies due to long-term small population size or inbreeding.

ADD REPLY

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6