Guidance on Reference-Based Assembly and Variant Analysis of Exome Sequencing Data
0
1
Entering edit mode
7 weeks ago
Rohan ▴ 20

Hello,

I am working with paired-end whole genome sequencing (WGS) data focused on the exome. My goal is to assemble these reads based on the human reference genome (hg38), identify variants, and subsequently obtain an assembled sequence that includes these variations. The assembled sequence will be used to predict protein sequences, and then I'll perform a BLASTp search to identify and extract genes of interest. Specifically, I want to check for variations within these genes.

Here's the pipeline I've been using:

  1. Alignment: bwa-mem2 mem hg38.fa "$paired_end_1" "$paired_end_2" > "$aligned_reads"

  2. Convert SAM to BAM: samtools view -Sb "$aligned_reads" > "$bam_file"

  3. Sort BAM file: samtools sort "$bam_file" -o "$sorted_bam_file"

  4. Index BAM file: samtools index "$sorted_bam_file"

  5. Variant Calling: bcftools mpileup -f hg38.fa "$sorted_bam_file" | bcftools call -mv -Oz -o "$vcf_file"

  6. Index VCF file: tabix -p vcf "$vcf_file"

  7. Generate Consensus Sequence: bcftools consensus -f hg38.fa "$vcf_file" > "$consensus_sequence"

  8. Predict Open Reading Frames (ORFs): getorf -sequence "$consensus_sequence" -outseq "$orfs_file"

Questions:

  1. Quality of Assembled Genome: Are there any additional steps or alternative tools I should consider to ensure the highest quality assembly, especially focusing on capturing exonic variants accurately?
  2. Variant Filtering: What are the best practices for filtering variants in this context to minimize false positives and ensure that the identified variants are reliable?
  3. Gene Extraction and Variation Analysis: After generating the consensus sequence, is using getorf the best approach for ORF prediction, or are there more accurate tools available? Once I have the predicted protein sequences, what are the recommended workflows for performing BLASTp and extracting specific genes of interest for further variation analysis?
  4. Post-Processing and Analysis: What additional post-processing steps should I consider for a thorough analysis, particularly focusing on detecting and interpreting variations within specific genes?
  5. Tools and Software Versions: Are there specific versions of the tools mentioned above, or additional software, that you recommend for this kind of analysis? Are there any known issues or bugs with certain versions that I should be aware of?

Thank you in advance for your insights and recommendations!

samtools bcftools genome wgs bwa-mem • 341 views
ADD COMMENT
1
Entering edit mode

Well, you can do steps 2, 3 and 4 in one command:

samtools sort aligned_reads.sam -o sorted_reads.bam --write-index

You can also pipe step 1 into new step 2.

For samtools and bcftools use the latest version (1.20 at the time of writing)

ADD REPLY

Login before adding your answer.

Traffic: 1155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6