Question

Fasta output from bedtools GetFastaBed

0

Entering edit mode

9.3 years ago

umn_bist ▴ 390

So I downloaded the mm10 repeatmasker track as a BED file from UCSC table browser. I would like to make a pseudogenome fasta using bedtools to align my Chip-seq samples using BWA. This is all to do Chip-seq peak calling, seeing which repetitive sequences were enriched

First few lines of my BED file:

chr1    67108752    67108881    RLTR17B_Mm  239 +
chr1    134217651   134217732   BC1_Mm  230 -
chr1    8386825 8389555 Lx2 8310    -
chr1    16776988    16779051    L1_Mus1 32159   +

I used the 'name' column in the BED file for the FASTA headers in the output FASTA file
I forced strandedness (as there were some repeat elements on the antisense strand),

First few lines of my output fasta:

RLTR17B_Mm  gtgtatggtgtgagtctatgtggtgtgtagtatgtatgtggtatata...(condensed for space)
BC1_Mm  tatctcagtgatcgggttcttgcgtagcgcgcaaagccccaaatttagt...(condensed for space)
Lx2 aaagcagtatatagcaaaccagtagccaacatcaaactaaatggagggaaa...(condensed for space)
L1_Mus1 gtaggatcaatatagtgaaaatggctattttgctgaaagcaatctaca...(condensed for space)
B4  ctcagggagtaaagcaagcttcttgcaagtgttgggaatggagtt...(condensed for space)

First few lines of my mentor's mm10_rmsk.fasta:

>(CAAAAA)n
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGCTATCTAAATGAAaaaaacaaaaacaaaaacaaaaacaaacaatcaaaaatcaaaacaacagaacaaaacaaaacaaaaaAAATGGAAGCACTGANNNNNNNNNNATTGCATATAATCCAaaaaacaaaaac
>IAPLTR3-int
NNNNNNNNNNNNNcatctctccagcccctcttatccctt

My questions are:

Will my fasta output suffice for BWA alignment?
Is my mentor treating split/spliced BED12 entries as distinct BED intervals when computing coverage? I noticed that he used a BED12 interval file to make the fasta and was curious how I could get the interval file (of mm10 repeat sequences) from UCSC browser.

ChIP-Seq BWA • 2.7k views

ADD COMMENT • link updated 9.3 years ago by Charles Plessy ★ 2.9k • written 9.3 years ago by umn_bist ▴ 390

1

Entering edit mode

A couple of comments that do not answer you question directly.

The sequences (fasta) for the repeats can be downloaded directly from UCSC which would save you an extra step. Tools -> Table Browser -> mm10 -> Variation and repeats -> RepeatMasker, then "sequence" as output format.

This is all to do Chip-seq peak calling, seeing which repetitive sequences were enriched

Beware of dragons, that is multimapper quantification. See this and this where the matter of mapping and quantification of repeat elements is discussed (paywall). There are more papers on this out there, but this just came out.

ADD REPLY • link 9.3 years ago by A. Domingues ★ 2.7k

score 1 · Answer 1 · 2016-04-07

1

Entering edit mode

9.3 years ago

Charles Plessy ★ 2.9k

Beware: when indexing reference sequences, BWA replaces Ns by random bases. Therefore, if your reference file contains as many Ns as in the two lines that you showed as example, you may be aligning on mostly random sequences, rather than on the repeat sequences that you intended.

ADD COMMENT • link 9.3 years ago by Charles Plessy ★ 2.9k