I’m new to bioinformatics and working with a human whole genome sequencing dataset in FASTQ format. My goal is to identify strain-specific integration sites for HPV and EBV in the genome, if any exist at all. I’d greatly appreciate beginner-friendly guidance on how to approach this, including any specific command-line examples that I can follow to learn.
I’m particularly interested in which tools and resources are commonly used for this type of analysis, and how to interpret the results to distinguish true integration events from noise or sequencing artifacts. If you have recommendations for tools or workflows, especially ones that allow strain-specific detection, that would be fantastic. Also, any tips on where to find reliable reference genomes for HPV and EBV would be helpful.
Thank you for your time and advice!
Search here: https://www.ncbi.nlm.nih.gov/datasets/genome/
You will want to choose RefSeq genome versions when available. If not you can use GenBank.
Similar analyses have been published before which should help you get started .
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ctm2.971
https://pmc.ncbi.nlm.nih.gov/articles/PMC9376973/
https://academic.oup.com/bioinformatics/article/37/20/3405/6278295