Question

Advice for analyzing SnpEff results

0

Entering edit mode

16 months ago

ekirsch • 0

Hello!

I am hoping to get some recommendations for how to analyze the large output files I got after running SnpEff to compare genetic load across populations. After running SnpEff, I used SnpSift to create three different vcf files called high_impact_variants.vcf.gz, low_impact_variants.vcf.gz, and moderate_impact_variants.vcf.gz. I then used the following script to create txt files containing the number of mutations of each impact class for each sample ID.

# Define paths
INPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/data"
OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Extract sample IDs and counts of each type of mutation
for vcf_file in "${vcf_files[@]}"; do
    base_name=$(basename "$vcf_file" .vcf.gz)
    output_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
    echo -e "Sample\tCount" > "$output_file"

    # Extract sample IDs and genotype information
    bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN[\t%SAMPLE:%GT]\n' "$INPUT_DIR/$vcf_file" | \
    awk -F'\t' '
    {
        for (i=6; i<=NF; i++) {
            split($i, a, ":");
            sample=a[1];
            gt=a[2];
            if (gt != "./.") {
                count[sample]++;
            }
        }
    }
    END {
        for (sample in count) {
            print sample "\t" count[sample];
        }
    }' >> "$output_file"
done

Now I am trying to normalize the results by the number of alternative alleles using this script:

OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
BASE_PATH="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Read allele counts into a single associative array
declare -A alt_allele_count
for subspecies in "alaudinus" "brooksi" "beldingi" "nevadensis" "rostratus"; do
    allele_count_file="${BASE_PATH}/${subspecies}_allele_counts.txt"
    if [[ -f "$allele_count_file" ]]; then
        while IFS=$'\t' read -r chrom pos total_alleles ref_alleles alt_alleles; do
            key="${chrom}_${pos}"
            alt_allele_count["$key"]=$alt_alleles
        done < "$allele_count_file"
    else
        echo "Allele count file for $subspecies does not exist."
    fi
done
# Normalize counts using precomputed allele counts
for vcf_file in "${vcf_files[@]}"; do
    base_name=$(basename "$vcf_file" .vcf.gz)
    sample_count_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
    norm_output_file="${OUTPUT_DIR}/${base_name}_normalized_sample_counts.txt"
    if [[ -f "$sample_count_file" ]]; then
        awk -F'\t' -v OFS='\t' '
        BEGIN {
            for (key in alt_allele_count) {
                split(key, arr, "_")
                chrom = arr[1]
                pos = arr[2]
                alt_count[chrom][pos] = alt_allele_count[key]
            }
        }
        NR==1 {print $0, "NormalizedCount"}
        NR>1 {
            sample=$1
            count=$2
            key = sample
            alt_count = (key in alt_count ? alt_count[key] : 0)
            if (alt_count > 0) {
                print $0, (count / alt_count)
            } else {
                print $0, 0
            }
        }' "${sample_count_file}" > "${norm_output_file}"
    else
        echo "Sample count file for ${base_name} does not exist."
    fi
done

Does this seem like a robust way to analyze SnpEff results? I have not worked with this type of data before so any advice for dealing with these results would be greatly appreciated :)

snpsift snpeff vcf • 871 views

ADD COMMENT • link updated 16 months ago by Ram 45k • written 16 months ago by ekirsch • 0

0

Entering edit mode

what is the scientific question you're trying to answer here ?

ADD REPLY • link 16 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

I am trying to see if certain subspecies have more high-impact mutations than other subspecies due to long-term small population size or inbreeding.

ADD REPLY • link 16 months ago by ekirsch • 0