Hello everyone
I would like to know other people opinion about this, because I am mainly used to do RNA-Seq analysis, and I would appreciate some help :)
I have sequenced a human genome with mate paired end reads, and these reads are supposedly big enough to being able to find big indels easily -concretely, I am looking for some big insertions and/or recombinations between different chromosomes-. I am wondering which would be the best pipeline for this, and how to find these structural variants. There are many tools and options, and I am not sure which would be the best or the most usual way to proceed.
Thank you all for your help and patience. I would love to hear your opinion!
What's your estimated coverage? What is the read length?
The read length is of 125 bp. The average alignment insert size is 7,6 kb.
I havent got an estimated coverage right now. Let's suppose it is enough, the sequencing report says it has about 200k million reads :)
PD: Thank you very much
Just to get this straight, I see you wrote "My intention is to assembly these reads with the reference genome". Do you mean mapping to the genome or de novo assembly of the reads? It's an approach worth trying to perform a de novo assembly, but I'm not sure that you'll get better results than reference mapping in this case. Specific software for finding structural variants in WGS do exist.
Even if you have performed the de novo assembly, you will still have to compare to a reference genome to detect your variants...
I have just edited it; sorry for the confusion. I'll try to be clearer:
-I am looking for a duplication of a region of a chromosome in another chromosome.
-I guess the best way to proceed would be the de novo assembly. As you, I am also not sure that with these reads it would be of use, but I had no other option but to choose this sequencing method.
-In order to limit the computational resources, would it be a good approach to do a strict mapping versus a reference genome, recovering the unaligned reads, assemblying them de novo and then mapping them versus the reference genome? I guess that with this I would at least determinate the extremes of the duplicated region and in which chromosome it is found (the middle region would map to the original chromosome). What would you do?
I think de novo assembly is great and this would be the way to get the most out of a genome. But I would use long read sequencing technology for this (Oxford Nanopore/Pacbio) ;-)
Right now, I think it would be the easiest and quickest to go with a tool like BreakDancer (https://omictools.com/breakdancer-tool) (just one of the first I picked after google searching). There will probably be some documentation and support for those tools.
Yes, I would have used it, too :)
Thank you very much for your help, Wouter!!
I'm curious to how it will work for you!
Thank you for all. I was a bit rusty in this area :) I'll tell you how it went whenever I finish!
One last thing, if you don't mind giving your opinion. I have decided how to proceed in order to minimize the computational resources needed for this :)
I will map the reads vs the reference genome, but removing from it the two chromosomes involved in this syndrome. I will then assembly de novo the remaining unmapped reads, and check the results vs the two chromosomes of reference of interest. My only concern is that, if everything goes ok, there will be two possible scaffolds for each chromosome (lets say we have chr O and P with WT sequences 123 and 456, respectively; but in this case we have an insertion of a duplicated region, giving 123 and 4526; so I am afraid the assemblier will also give 126 and 4523 as possible contigs). In that case... I do not know how to explain why I remove the "false" contigs, but I hope I will be able to explain it :)