I am trying to quantify/estimate the amount of double stranded ALU elements to compare two conditions. My very coarse approach was to align the reads to repeat element sequences, and then summarize how many reads map sense or antisense to each element. If I summarize all repeat elements, there is a bias to sense-mapping reads (~1.3). However, in the ALU elements, the ratio is nearly 1 with a shift in one of the conditions - which matches the experimental hypothesis.
The issue is that due to the repetitive nature of these elements I am having second thoughts about if this approach is at all valid.
Briefly, my approach to calculate strand bias in repeat elements:
- rRNA depleted RNA-seq, stranded library
- reads were mapped to transposable element sequences (derived from repeatmasker, one contig = one element, example below) with STAR, keeping one random alignment for any read that maps up to 100 locations
- Alignments in each repeat element sequence were then counted with the Bioconductor package
Rsamtools
with the following setting:- repeat elements were considered those whose name doesn't match "^5S|^7S|_n$|rRNA|^tRNA|^U[0-9]|^RNA"
- only proper read pairs were counted
- alignments in the forward stand
isFirstMateRead = TRUE, isMinusStrand = TRUE
- alignments the reverse strands
isFirstMateRead = TRUE,isMinusStrand = FALSE
- A ratio of sense /antisense reads was then calculate for each repeat element sequence.
Does this make sense at all? Is there a better way of doing it?
grep "AluJb" -A 100 all_repeats.hg38.fa | head -20
>AluJb
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNTCCATAAGAATGGAAAGAAAACATGGCCAGGTGCAGTGGC
TCACACCTGTAATCCCACCACTTCAGGAGGCTGAGGCAACATGGCAAAACCTTCTCTTCA
AAAAATTTTTTAAAAGTTAGCTGGATGTTGTGGAGGCAAGAGGATCACTTGAGGATCACT
TGAGTCCATGAGGTCAAGGCTGCAGTGAGTCATGTTTGCACCACTGCACTCTAGCCTAGG
TGACAGAGCTAGTCACTATCAAAAAAAAAAAAAAAAGAATGGAGAGAATGCTACATGAGA
GAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNATAGATTTTTTTAAAAAGAAAACTGGCCAGGTACT
GTGGCTTATGTCTGTAATATCAGCATGTTGGGAGGCCAAGGCAGGATTACTTGAGCCCAG
AAATTCCAGACCAGCCTGAGAATTTGGCAAAACTCTGTCTCTACAAAAAATACAAAAATT
AGCCAAGTTTGGTGGCATGTGCCTGTAGTACCAGCTACTTGGGAGGCTGAGGTGGAAGAA
TAGCTTGAGTCTGGGAGGTCAAGGCTGCAATGAGCTGTGATTGCACCACTGCACTCAAGC
CTGGGTGGTAGAGTAAGACCCTGTCTCAAAAAAAAAAAAAAAAAAAGAAAAATCACTAAG
CAAAATAAGACATGTGAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
If no one adds a valid answer in the coming days I will accept my own answer.