Question

Convert bed format RNA-seq data to bam format.

1

Entering edit mode

10.4 years ago

biomedicineman1 ▴ 30

I'm learning RNA-seq data analysis using publicly available data from Epigenome Roadmap Project.

Several lines of the data look like below:

chr1    24291630    24291704    HS10_98:8:1101:12145:76943#A07825-1.L    0    +
chr1    24291790    24291865    HS10_98:8:1101:12145:76943#A07825-1.R    0    -
chr1    26227179    26227254    HS10_98:8:1101:13224:12145#A07825-1.L    0    +
chr1    26227324    26227400    HS10_98:8:1101:13224:12145#A07825-1.R    0    -

Questions:

What do the "L" and "R" here refer to? Left/right part of the single reads spanning exon-intron junction? Or Left reads and Right reads of the paired-end reads?
How can I convert such bed format into bam? The Epigenome Roadmap doesn't provide sra for this dataset.

thx!

RNA-Seq • 4.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by biomedicineman1 ▴ 30

Ram · Answer 1 · 2014-06-19

What do the "L" and "R" here refer to? Left/right part of the single reads spanning exon-intron junction? Or Left reads and Right reads of the paired-end reads?

I'm going to infer that indeed those are the "left" and "right" paired-end reads, given that the BED name entry seems to indicate a flowcell coordinate, and said coordinate is shared within sets of two reads in your example.
How can I convert such bed format into fastq? The Epigenome Roadmap doesn't provide sra for this dataset.

If you truly wanted FASTQ and not FASTA, and the only source you have for the data is this BED file, then you would have to fake the quality scores. But you could construct the rest of the FASTQ like this:
- For the first line of each FASTQ read, use the fourth column of the BED file.
- For the second line of each FASTQ read, you would need to extract the portion of the reference genome given by the first three columns of the BED file. So for the first line of your BED, you would want to have the sequence between bases 24,291,630 and 24,291,704 on chromosome 1, inclusive.
- For the third line of each FASTQ read, just put a '+' [or some arbitrary value(s)]
- For the fourth line, you would need to create fake quality scores, the number of which would correspond to the number of bases you extracted from the reference genome for that read.
This might be made easier through usage of the BedTools getfasta tool.

EDIT: The subject asks for conversion to BAM format, but the question body asks for conversion to FASTQ.

To convert to BAM, there's a tool suite called "Bedtools," which has a tool, BedToBam, that should do the job for you if you supply a reference genome.

HTH!