Hello!
I am hoping to get some recommendations for how to analyze the large output files I got after running SnpEff to compare genetic load across populations. After running SnpEff, I used SnpSift to create three different vcf files called high_impact_variants.vcf.gz
, low_impact_variants.vcf.gz
, and moderate_impact_variants.vcf.gz
. I then used the following script to create txt files containing the number of mutations of each impact class for each sample ID.
# Define paths
INPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/data"
OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Extract sample IDs and counts of each type of mutation
for vcf_file in "${vcf_files[@]}"; do
base_name=$(basename "$vcf_file" .vcf.gz)
output_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
echo -e "Sample\tCount" > "$output_file"
# Extract sample IDs and genotype information
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/ANN[\t%SAMPLE:%GT]\n' "$INPUT_DIR/$vcf_file" | \
awk -F'\t' '
{
for (i=6; i<=NF; i++) {
split($i, a, ":");
sample=a[1];
gt=a[2];
if (gt != "./.") {
count[sample]++;
}
}
}
END {
for (sample in count) {
print sample "\t" count[sample];
}
}' >> "$output_file"
done
Now I am trying to normalize the results by the number of alternative alleles using this script:
OUTPUT_DIR="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus/output"
BASE_PATH="/project/sedmands_1143/ekirsch/savannahsparrow/savannahsparrow/41-Passerculus/41-Passerculus"
# Define VCF files
vcf_files=("low_impact_variants.vcf.gz" "moderate_impact_variants.vcf.gz" "high_impact_variants.vcf.gz")
# Read allele counts into a single associative array
declare -A alt_allele_count
for subspecies in "alaudinus" "brooksi" "beldingi" "nevadensis" "rostratus"; do
allele_count_file="${BASE_PATH}/${subspecies}_allele_counts.txt"
if [[ -f "$allele_count_file" ]]; then
while IFS=$'\t' read -r chrom pos total_alleles ref_alleles alt_alleles; do
key="${chrom}_${pos}"
alt_allele_count["$key"]=$alt_alleles
done < "$allele_count_file"
else
echo "Allele count file for $subspecies does not exist."
fi
done
# Normalize counts using precomputed allele counts
for vcf_file in "${vcf_files[@]}"; do
base_name=$(basename "$vcf_file" .vcf.gz)
sample_count_file="${OUTPUT_DIR}/${base_name}_sample_counts.txt"
norm_output_file="${OUTPUT_DIR}/${base_name}_normalized_sample_counts.txt"
if [[ -f "$sample_count_file" ]]; then
awk -F'\t' -v OFS='\t' '
BEGIN {
for (key in alt_allele_count) {
split(key, arr, "_")
chrom = arr[1]
pos = arr[2]
alt_count[chrom][pos] = alt_allele_count[key]
}
}
NR==1 {print $0, "NormalizedCount"}
NR>1 {
sample=$1
count=$2
key = sample
alt_count = (key in alt_count ? alt_count[key] : 0)
if (alt_count > 0) {
print $0, (count / alt_count)
} else {
print $0, 0
}
}' "${sample_count_file}" > "${norm_output_file}"
else
echo "Sample count file for ${base_name} does not exist."
fi
done
Does this seem like a robust way to analyze SnpEff results? I have not worked with this type of data before so any advice for dealing with these results would be greatly appreciated :)
what is the scientific question you're trying to answer here ?
I am trying to see if certain subspecies have more high-impact mutations than other subspecies due to long-term small population size or inbreeding.