Hello,
I have 12 WGS samples that are part of a big pedigree, I need to pinpoint genomic regions ~1 Mbp that are identical between some or many or all of them. I've never done analyses like these before and it's difficult to Google because I don't know the words people use for these things. I believe "linkage analysis" is close, but seems like that's more about judging whether two SNVs are in disequilibrium or not. But isn't that what phasing is?
Ahh. Please point a finger in the right direction! I usually manage to figure things out once I get started, but I'm not even sure where to start here. :-]
Thanks in advance, Joel
I think, that synteny is probably the term you are looking for...it refers to the conserved order of aligned genomic blocks between species.
if e.g. the 12 WGS samples are same species, synteny would probably overcomplicate things... it may make sense to look for "low nucleotide diversity" https://www.york.ac.uk/res/dasmahapatra/teaching/MBiol_sequence_analysis/workshop4_2019.html#nucleotide_diversity
this tool
pixy
might be a good tool to analyze it as an alternative to that tutorial too https://pixy.readthedocs.io/en/latest/plotting.htmlHey that's perhaps exactly what's needed hmm... so I would identify the identical 1Mbp regions by clear drops in the pi statistic?
Assuming your data is assembled use the programs mentioned in this thread: What is the best tool to do a dot-plot alignment of two chromosomes?
Thanks for your reply Genomax! My data is not assembled I'm afraid, it's currently in fastq, bam, and vcf format. I heard a term "IBD", Identity By Descent, do you think this is what I'm looking for? It seems promising.
I assume your reads aren't 1Mb in size so they would need to be assembled/aligned first
Absolutely, but don't you think there's another way? I doubt assembling every WGS sample is the most efficient approach!
There is a number of software packages listed for IBD in Wiki entry https://en.wikipedia.org/wiki/Identity_by_descent You can see if any will take unassembled reads.
How related are these samples (are they human?) since
They are all human and related within three or four generations. Thank you very much for that list, I will pick out a few and try!
Assemble one genome and map the reads of the other samples against it with e.g. Bowtie2. Use Bedtools multiinter to extract the shared regions which meet your threshold..
Hi Heikki, that's an interesting approach, I will return to it if the IBD programs don't work out. Thank you!
Hello again 5heikki, I'm beginning to suspect that the IBD softwares I've found are meant for much larger samples than 12, so all the statistics can be calculated properly. That means I should try your approach instead. I will do so. Have you done it before? I imagine you have. Did it work well?
Also, all samples are human. Is assembly still necessary? And, how would I choose the sample for assembly? The one with best coverage? Hmm...
And uh, if I choose one sample to assemble, I will essentially pretend that sample was haploid, no? There must be another way... hmm. Perhaps I can just use multiinter on all my bam files?
Hi, I thought you were working with bacterial genomes. Sorry, can't help really..
No worries m8, thanks for trying!
Hey Genomax, I have fastq and bam, I could create assemblies of all samples and dotplot them manually... but surely that's insane? Also, while assembling I would collapse the data to haploidity(?). Hmm...
what kind of reads do you have... if it's short-reads... welp good luck with the phasing.
why don't you use GATK to call variants and then group them up in R or python based on megabase windows manually?
Hey, standard 150bp Illumina reads.
I am calling variants in all samples with GATK as we speak, but what do you mean with "group up on megabase windows"?
I would do this in R myself with GRanges. Essentially, convert the coordinates to like chr3:3842934_G>A and then do a sliding window (width = 1 MB) across the genome (if thats rough computationally you could try tiling) and make like a table with each variant in that region and then find the individuals that exhibit like 95% or same SNPs for each window. You might have to think about what filters/variant allele freq. cutoff you want to use and whether 95% is too much/little to work with.
Hey, sorry for late reply, I need to enable notifications somehow...
This sounds to me like a straightforward comparison of SNPs - I could do it in e.g. Python too, right? Or is GRanges doing some magic behind the scenes? Also, I suspect this analysis is included in Pixy: https://pixy.readthedocs.io/en/latest/plotting.html
One of the calculated statistics is variation within a population - surely it is based on comparing SNPs, like you suggest?
Thank you for your time and help!
Yeah you could do it with any language. Using Bioconductor packages, it would take me 2-3 hours probably in R with GRanges/VariantAnnotation to do the whole thing, if you're not familiar with R then I would not suggest that route.
Understood. Thank you Ben! :)