Question

Short reads to identify stuctural variants

1

Entering edit mode

17 days ago

priya.bmg ▴ 70

Hello,

I have identified structural variants from long-read whole genome data in a family. I want to verify if these identified variants are also in additional members in the same family, by analyzing the short-reads whole genome data. I have the short-read as well as the long-read data from the same individual. So, I first tried the structural variant calling using short-reads (MANTA, DELLY and other tools) for this individual (also with long-read data). I could find very few SV overlaps with long-read and short-read data .

I also tried aligning short-reads against long-read as reference. In regions of tandem repeats, the aligner misaligns the short-read data. Is there a better way to check if the structural variants (identified from long-read data) are there in short-read data like using pangenome as reference?

Thank you

Priya

SV long-read short-read • 434 views

ADD COMMENT • link 7 days ago by priya.bmg ▴ 70

0

Entering edit mode

When you say "misaligns" data what do you mean?

The concept of misalignment is often "misused" (pun intended ;-)

If the short read alignment is mathematically correct, then it is not an issue of "misalignment" but an issue of not being fully deterministic. What I am saying here is that it is a problem of the data rather than the method.

A short read from a repeating region cannot be mapped to its original location because we have no information which repeating region it originated from. Our best bet would some sort of guess.

ADD REPLY • link 16 days ago by Istvan Albert 102k

0

Entering edit mode

Thank you for this nice insight .Do you have any suggestions on methods to confirm if the identified structural variants in long reads are also there in short-read data?. For structural variants at least in the size range between 50 -500 bp?

That said, I attempted few things which seem incorrect in hindsight. I took insertion breakpoints identified from long-read data (in bed format) and used the hg38 reference to extract the corresponding regions, including the inserted sequences along with flanking sequences as fasta file. I then aligned short-read fastq files to this extracted fasta using bwa-mem to see whether any reads would align across the insertion.

Short reads aligned well within the inserted sequence but showed poor alignment in the flanking regions. I now realize that since the inserted sequences do not exist in the reference, extracting them this way (based on reference coordinates) may not accurately represent the real genomic context.

ADD REPLY • link 8 days ago by priya.bmg ▴ 70

0

Entering edit mode

Short reads aligned well within the inserted sequence but showed poor alignment in the flanking regions.

If you have short-reads that cover that insertion you should not see this. Have you looked to see if there are soft-clipped parts for the alignments that you do see for those reads? IGV may not show those by default.

ADD REPLY • link 8 days ago by GenoMax 150k

0

Entering edit mode

These insertions are seen in tandem repeat regions, which could have lead to the poor alignment in the flanking regions. CIGAR also showed reads being soft clipped at both breakpoints regions.

ADD REPLY • link 7 days ago by priya.bmg ▴ 70

0

Entering edit mode

by analyzing the short-reads whole genome data

Is that because you don't have long read data on those samples?

I have the short-read as well as the long-read data from the same individual.

Is this the initial sample where you identified something that you would like to see if it exists in other samples?

ADD REPLY • link 8 days ago by GenoMax 150k

0

Entering edit mode

Yes, I am working in families. We did whole genome long read sequencing for three individuals in each family. To confirm if these structural variants identified by long reads are also seen in other individuals in each family, I plan to use the available short-read data. For one sample, we have both short-read and long read data, but, structural variants identified by long reads could not be seen in short-read data.

ADD REPLY • link 7 days ago by priya.bmg ▴ 70