I want to investigate genomic rea for example inversions, deletions, insertions and duplications in the leprosy genome. I have access to raw sequences from the sequencer and to reads that where already mapped to the Mycobacterium eprae genome (NC_002677.1). All methods and programs that I found like LUMPY and Breseq are based on Split Read and Read Pair approaches. But since I am in the ancient DNA field there are several problems with the raw data.
We have no defined insert size since the DNA is already highly fragmented
There are different read sizes ranging form 25bp to 101bp.
Our normal post processing step is to clip and merge. Forward and reverse reads with an overlap of 10 bp will be merged to get a longer sequence. after this step there will be a mixed file of merged and paired end reads.
I recently asked a colleague of mine and she suggested looking in the IGV for rearrangements. Means doing a de novo assembly, map the reads back to the assembly and look how the coverage differs between mapping against the contigs and against the reference. But at first this is a very time consuming step and at second it seems to be very subjective.
It there any other approach with the data that I have access to?
Best and thanks in advance :)
There's some preprocessing steps that I'm not quite clear on. To clarify: you are doing the following:
- 2x101bp illumina paired-end sequencing
- adapter trimming which can reduce your reads to as short as 25bp due to highly fragmented source material
- merge overlapping read pairs into a longer synthetic read
- what level of contamination do you have, and how are you dealing with it?
Are you error correcting the overlapping bases?
You are correct in asserting that there isn't much of signal in your data for SV tools to work with. Since your genome is so small, de novo assembly is not computationally prohibitive so that's probably worth trying but you're going to lose most of reads since they're so short. You'll also assembly a whole lot of junk if you have a high level of contamination. If you're interested in ancient novel sequence that has since been lost, this is probably your only option.
If you use a traditional SV tool, then you're just going to have to deal with the high signal to noise ratio due to your data. Pindel works pretty well with short reads since it was developed in the era of 2x36bp sequencing but can only detect certain events and has a high false discovery rate. Another option is to use GRIDSS (disclaimer: my software) and see how much of a signal it can get out of the noise. GRIDSS augments the traditional read pair and split read signals with a breakend assembly step. Essentially, it attempts an assembly at every genomic locus that there might be an SV event, then does split read analysis on the breakend assembly contigs to identify the other loci involved. This approach allows some of the reads (such as short soft clips) that are ignored in traditional split read analysis to contribute to the variant call.
With such short fragments, there's not much of a signal to start with but it's still better than manually inspecting 3Mb of sequence in IGV.
Yap, that's the preprocessing we do. It is basically the preprocessing step as described in the EAGER paper EAGER: efficient ancient genome reconstruction. Reads are only merged if the overlapping bases show a quality score of a certain threshold. Otherwise if the overlap is not long enough or the bases are of bad quality there is no merging. In terms of contamination we try to deal with it on the bioinformatic level but it is not that easy. I myself tried some different tools but due to the nature of ancient DNA most of them wont work properly. Most of the time its just carefully align and afterwards inspect the alignments.
I'll try out GRIDSS now. Thank you for your suggestion :)
Thank you, I downloaded it and tried to get it working but encountered the following error:
ERROR 2016-10-26 12:11:38 IdsvSamFileMetrics No pair-end insert size metrics found in ./G507_UDG.combined_rmdup.sorted.idsv.working/G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt.
ERROR 2016-10-26 12:11:38 InsertSizeDistribution Unable to extract insert size distribution from ./G507_UDG.combined_rmdup.sorted.idsv.working/G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt
ERROR 2016-10-26 12:11:38 Idsv gridss currently supports only FR read pair orientation. If usage with other read pair orientations is required, please raise an enchancement request at https://github.com/PapenfussLab/gridss/issues
[Wed Oct 26 12:11:38 CEST 2016] au.edu.wehi.idsv.Idsv done. Elapsed time: 2.45 minutes.
So the program is not running without a defined insert size?
Single-end sequencing should be fine (there's just not much of a signal), but it doesn't support mate pair libraries as the read pair logic (currently) assumes standard FR (forward-reverse) illumina paired-end sequencing read orientation. It's possible that the orientation is being inferred as RF due to untrimmed overlapping reads, or it's an erroneous error and I need to add additional test coverage for single-end libraries. If you could raise an issue at the link given in the error message and attach the G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt file I can look into why it's decided that your library isn't in the expected orientation.
I have to say most of the approaches on SV-detection depend on paired-end data. No insert-size so already there is a data loss.
Also, since 30 bases (starts at 24-26bp) would be the least you need to get correct mapping information you would lose some data. Even if you need split-reads, if your read is 50bp and 30+20 is the split, the ambiguity of mapping for 20bp is high or almost unmappable. You do get information from longer reads, one developer suggested that you need atleast 70bp to actually trust break-points after various levels of testing (GRIDSS).
I think you might be using the leehom post-processing, usually 10bp for merging overlapping PE-data is quite minimal. FLASH uses 25 and they have an error rate close to 0.25%.
So whatever you might want to do with the existing SV-detection pipelines might be limiting. Also, you have an error-rate which might prove difficult at such small read-lengths. As your colleague suggested, the best way is to go for atleast local-assemblies where you expect break-points or a complete denovo. It might not be that subjective, might be time-consuming - but we are not that close to push-button bioinformatics anyway :)
We've had pretty decent luck with structural variation in bacteria with split read alignments using bwa mem, then generating BigWig files from alignments filtered to only contain split read alignments as a coverage metric. As Rohit mentions you would probably want to use only reads >= 70nt long.
In our case these were also paired-end, but frankly this was only used to corroborate the possible sites; normally read depth with bacterial sequences is pretty high and the splits piled up to a high end degree we just looked for spikes of reads using the coverage information in IGV. From this we were able to identify the cause of a pleiotropic mutation (transposon insertion in a gene).
Yap, that's the preprocessing we do. It is basically the preprocessing step as described in the EAGER paper EAGER: efficient ancient genome reconstruction. Reads are only merged if the overlapping bases show a quality score of a certain threshold. Otherwise if the overlap is not long enough or the bases are of bad quality there is no merging. In terms of contamination we try to deal with it on the bioinformatic level but it is not that easy. I myself tried some different tools but due to the nature of ancient DNA most of them wont work properly. Most of the time its just carefully align and afterwards inspect the alignments.
I'll try out GRIDSS now. Thank you for your suggestion :)
Is GRIDSS publicly available? Because I can't find it with google.
https://github.com/PapenfussLab/gridss
Thank you, I downloaded it and tried to get it working but encountered the following error:
ERROR 2016-10-26 12:11:38 IdsvSamFileMetrics No pair-end insert size metrics found in ./G507_UDG.combined_rmdup.sorted.idsv.working/G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt. ERROR 2016-10-26 12:11:38 InsertSizeDistribution Unable to extract insert size distribution from ./G507_UDG.combined_rmdup.sorted.idsv.working/G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt ERROR 2016-10-26 12:11:38 Idsv gridss currently supports only FR read pair orientation. If usage with other read pair orientations is required, please raise an enchancement request at https://github.com/PapenfussLab/gridss/issues [Wed Oct 26 12:11:38 CEST 2016] au.edu.wehi.idsv.Idsv done. Elapsed time: 2.45 minutes.
So the program is not running without a defined insert size?
Single-end sequencing should be fine (there's just not much of a signal), but it doesn't support mate pair libraries as the read pair logic (currently) assumes standard FR (forward-reverse) illumina paired-end sequencing read orientation. It's possible that the orientation is being inferred as RF due to untrimmed overlapping reads, or it's an erroneous error and I need to add additional test coverage for single-end libraries. If you could raise an issue at the link given in the error message and attach the G507_UDG.combined_rmdup.sorted.idsv.metrics.insertsize.txt file I can look into why it's decided that your library isn't in the expected orientation.