So I downloaded the mm10 repeatmasker track as a BED file from UCSC table browser. I would like to make a pseudogenome fasta using bedtools to align my Chip-seq samples using BWA. This is all to do Chip-seq peak calling, seeing which repetitive sequences were enriched
First few lines of my BED file:
chr1 67108752 67108881 RLTR17B_Mm 239 +
chr1 134217651 134217732 BC1_Mm 230 -
chr1 8386825 8389555 Lx2 8310 -
chr1 16776988 16779051 L1_Mus1 32159 +
- I used the 'name' column in the BED file for the FASTA headers in the output FASTA file
- I forced strandedness (as there were some repeat elements on the antisense strand),
First few lines of my output fasta:
RLTR17B_Mm gtgtatggtgtgagtctatgtggtgtgtagtatgtatgtggtatata...(condensed for space)
BC1_Mm tatctcagtgatcgggttcttgcgtagcgcgcaaagccccaaatttagt...(condensed for space)
Lx2 aaagcagtatatagcaaaccagtagccaacatcaaactaaatggagggaaa...(condensed for space)
L1_Mus1 gtaggatcaatatagtgaaaatggctattttgctgaaagcaatctaca...(condensed for space)
B4 ctcagggagtaaagcaagcttcttgcaagtgttgggaatggagtt...(condensed for space)
First few lines of my mentor's mm10_rmsk.fasta:
>(CAAAAA)n
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGCTATCTAAATGAAaaaaacaaaaacaaaaacaaaaacaaacaatcaaaaatcaaaacaacagaacaaaacaaaacaaaaaAAATGGAAGCACTGANNNNNNNNNNATTGCATATAATCCAaaaaacaaaaac
>IAPLTR3-int
NNNNNNNNNNNNNcatctctccagcccctcttatccctt
My questions are:
- Will my fasta output suffice for BWA alignment?
- Is my mentor treating split/spliced BED12 entries as distinct BED intervals when computing coverage? I noticed that he used a BED12 interval file to make the fasta and was curious how I could get the interval file (of mm10 repeat sequences) from UCSC browser.
A couple of comments that do not answer you question directly.
The sequences (fasta) for the repeats can be downloaded directly from UCSC which would save you an extra step. Tools -> Table Browser -> mm10 -> Variation and repeats -> RepeatMasker, then "sequence" as output format.
Beware of dragons, that is multimapper quantification. See this and this where the matter of mapping and quantification of repeat elements is discussed (paywall). There are more papers on this out there, but this just came out.