Question

how to use a bed file with htseq-count

7

Entering edit mode

11.2 years ago

Ann ★ 2.4k

Hi all,

For paired-end RNA-seq data I've been using featureCounts (subRead suite) because it produces a nice output, treats paired reads as a single fragment for counting, and accepts a very simple annotation file format that is easy to make.

However, when I tried to use it just now to count reads aligned to a draft genome that has more than 10,000+ scaffolds (blueberry) it crashed with a segmentation fault.

Rather than dig through the code to find out what's wrong, I'm taking a second look at htseq-count.

Does anyone know if htseq-count can accept bed files? The documentation doesn't say anything about BED files - just GFF. I found a post on BioStars somewhere that said it could take BED, but the documentation doesn't mention it.

I really, really don't want to have to convert my annotations file into GFF because GFF is so open to interpretation.

All I want is the number of fragments that overlap a set of spans on my genome. For my application, I know the general region of every gene but the fine details of where exons and introns start and end are still up in the air. Which means: I just need to know the number of fragments that map onto simple intervals that are defined by contig name, start, and end position.

Please help?

-Ann

RNA-Seq htseq-count • 11k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by Ann ★ 2.4k

1

Entering edit mode

Sorry that there were two annotation-related bugs in the old featureCounts-1.4.4 version that can crash the program. We have fixed the bugs and made a new release (1.4.5) yesterday:

http://sourceforge.net/projects/subread/files/subread-1.4.5/

Can you please try the new version? I hope it can finish the task without errors.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.2 years ago by dbpzdbpz ▴ 220

0

Entering edit mode

OK I'll try it - thanks!!

Update: It's running - no more segfault.

Also I wanted to mention I like the output format. I usually keep different samples in separate BAM files, so for me it's very convenient to use featureCounts because then I can run a simple command like:

featureCounts -F SAF -p -a SAF_for_featureCount.tsv -o counts.tsv *.bam

and get a table of counts w/ columns for each bam file/sample. My next step after that (usually) is to read the table into R, rename the columns (easy), and go from there.

I haven't tried the alignment capabilities of the subread package as yet but I look forward to trying it out.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.2 years ago by Ann ★ 2.4k

Ram · Answer 1 · 2014-06-11

I didn't try to use a BED file, but a GFF file with this structure seems to have worked OK:

I wrote out each locus as a single line of feature type exon and added a gene_id in the extra feature field. I didn't bother with transcript_id, and that doesn't seem to have mattered.

One "gotcha" was that I wrote the data using R, which wrote some values (example 100000) in scientific notation. This led to a number formatting error when I tried to use htseq-count. I then modified my R code after googling how to stop R from writing data in scientific notation.

I added this to my code:

options("scipen"=100, "digits"=4)

The GFF lines that worked looked like this:

scaffold00001    UNCC    exon    488375    489180    .    -    0    gene_id "101_g";
scaffold00096    UNCC    exon    126172    126444    .    +    0    gene_id "10103_g";

Ram · Answer 2 · 2014-06-11

2

Entering edit mode

11.2 years ago

Alex Reynolds 36k

BEDOPS bedmap can count the number of fragments that map onto intervals, using the --count operand.

Most simply, if you have two sorted BED files called intervals.bed and fragments.bed:

$ bedmap --echo --count intervals.bed fragments.bed > answer.bed

The file answer.bed contains each interval and the number of fragments that overlap each interval by one or more bases.

If you need more stringent or fractional overlap criteria, options are available for making adjustments.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi,

I've got a BAM file containing alignments of paired end reads that I made using tophat. So my "fragments" file is really an enormous BAM file containing RNA-seq read alignments. Based on the bedmap documentation, it doesn't seem that the program would be able to use my BAM files as inputs unless I somehow can convert them to fragments. But since this is RNA-seq data, I have no way to do that because the fragments came from RNA. So unless I'm missing something I don't think this will work.

But thank you! I wasn't familiar with bedmap and this looks like it will be very useful for a lot of applications and problems.

-Ann

ADD REPLY • link 11.2 years ago by Ann ★ 2.4k

0

Entering edit mode

BEDOPS includes bam2bed and gff2bed scripts to convert data sets in BAM and GFF3 to sorted BED, ready for use with bedmap and other tools. If you work with RNA-seq data, you can use the --split option with bam2bed to process reads with N-CIGAR operations, splitting them into separate BED elements. Some filtering might be needed to handle contigs from GFF3 data. See the conversion script section for more detail on these scripts.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by Alex Reynolds 36k

0

Entering edit mode

By the way, I noticed you mentioned matrix2png. It's a very useful tool!

ADD REPLY • link 11.2 years ago by Ann ★ 2.4k