Convert bed format RNA-seq data to bam format.
1
1
Entering edit mode
10.4 years ago

I'm learning RNA-seq data analysis using publicly available data from Epigenome Roadmap Project.

Several lines of the data look like below:

chr1    24291630    24291704    HS10_98:8:1101:12145:76943#A07825-1.L    0    +
chr1    24291790    24291865    HS10_98:8:1101:12145:76943#A07825-1.R    0    -
chr1    26227179    26227254    HS10_98:8:1101:13224:12145#A07825-1.L    0    +
chr1    26227324    26227400    HS10_98:8:1101:13224:12145#A07825-1.R    0    -

Questions:

  1. What do the "L" and "R" here refer to? Left/right part of the single reads spanning exon-intron junction? Or Left reads and Right reads of the paired-end reads?
  2. How can I convert such bed format into bam? The Epigenome Roadmap doesn't provide sra for this dataset.

thx!

RNA-Seq • 4.9k views
ADD COMMENT
0
Entering edit mode
10.4 years ago
Dan D 7.4k
  1. What do the "L" and "R" here refer to? Left/right part of the single reads spanning exon-intron junction? Or Left reads and Right reads of the paired-end reads?

    I'm going to infer that indeed those are the "left" and "right" paired-end reads, given that the BED name entry seems to indicate a flowcell coordinate, and said coordinate is shared within sets of two reads in your example.

  2. How can I convert such bed format into fastq? The Epigenome Roadmap doesn't provide sra for this dataset.

    If you truly wanted FASTQ and not FASTA, and the only source you have for the data is this BED file, then you would have to fake the quality scores. But you could construct the rest of the FASTQ like this:

    • For the first line of each FASTQ read, use the fourth column of the BED file.
    • For the second line of each FASTQ read, you would need to extract the portion of the reference genome given by the first three columns of the BED file. So for the first line of your BED, you would want to have the sequence between bases 24,291,630 and 24,291,704 on chromosome 1, inclusive.
    • For the third line of each FASTQ read, just put a '+' [or some arbitrary value(s)]
    • For the fourth line, you would need to create fake quality scores, the number of which would correspond to the number of bases you extracted from the reference genome for that read.

    This might be made easier through usage of the BedTools getfasta tool.

EDIT: The subject asks for conversion to BAM format, but the question body asks for conversion to FASTQ.

To convert to BAM, there's a tool suite called "Bedtools," which has a tool, BedToBam, that should do the job for you if you supply a reference genome.

HTH!

ADD COMMENT
0
Entering edit mode

Thanks. I also think it's paired-end. But problem is, I guess BedToBam is for single reads; how can I convert "L" and "R" here into bam format for paired-end reads? thx

ADD REPLY
0
Entering edit mode

Hi,

I am suffering the same problem as yours. And how do you deal with 'Thanks. I also think it's paired-end. But problem is, I guess BedToBam is for single reads; how can I convert "L" and "R" here into bam format for paired-end reads? thx'

Thanks...

ADD REPLY
0
Entering edit mode

In the BAM format, read1 vs read2 is specified by a bitwise flag. Check out the SAM specification, section 1.4.

The way I would tackle this would be to use a BAM library like pysam. Libraries like these make it easy to write data directly to BAM, though it would also be possible to write to SAM without using a focused library. In that case you'd just want to make sure you're adhering to the spec so that downstream processes don't choke on the file you've created.

ADD REPLY

Login before adding your answer.

Traffic: 2491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6