is 0-based start and 1-based end of bed format not compatible with htseq-count?
2
0
Entering edit mode
5.3 years ago
Zee_S ▴ 60

Dear biostars community,

Do you have any ideas how to "trick" htseq-count to treat a bed file of genomic coordinates as a gtf file of gene annotations so that i can get counts per genomic interval instead of counts per actual gene?

I have tried to edit the bed file to add upto 9 columns (like a standard gtf) but its possible that my feature label (in column 3) and gene_id label (in column9) are incorrect. i just added these labels arbitrarily. with this "fake" gtf file, i use a bam file with alignments for PE RNA-seq reads mapped with STAR.

this is what my "fake" gtf file looks like:

chr1    .       exon    0       10000   .       -       0       gene_id "1";
chr1    .       exon    10000   20000   .       -       0       gene_id "2";
chr1    .       exon    20000   30000   .       -       0       gene_id "3";

htseq count outputs this error and aborts:

Error occured when processing GFF file (line 1 of file *.gtf):
  start too small
  [Exception type: IndexError, raised in _HTSeq.pyx:376]

htseq-count crashes and output this error:

I assume that 0-based start is the problem here. how to edit these coordinates so that the intervals remain the same in terms of the region it spans? should it be 1 based? (i thought until now that the0-based start and 1-based end in bed format was compatible with htseq).

thank you so much for your help!

htseq-count index-error 0-based start bed to gtf • 4.7k views
ADD COMMENT
1
Entering edit mode

It would be a lot easier with featureCounts:

## convert to the SAF format:
awk 'OFS="\t" {print $1"_"$2"_"$3, $1, $2, $3, "."}' in.bed > out.saf

featureCounts -a out.saf -F SAF -o out.countmatrix (list.of.bam.files...)
ADD REPLY
0
Entering edit mode

thank you for the suggestion. since I'm already very familiar with htseq and made the gtf file that it likes, i prefer to continue to use htseq. i would like to know how to get around this error that the program raises- specifically why 0-based start is not compatible? and how to fix it?

ADD REPLY
1
Entering edit mode

GTF is 1-based. Convert 0- to 1-based coordinates. Should do the trick.

ADD REPLY
0
Entering edit mode

Thank you for your help !

ADD REPLY
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
1
Entering edit mode
5.3 years ago
colin.kern ★ 1.1k

I assume that 0-based start is the problem here. how to edit these coordinates so that the intervals remain the same in terms of the region it spans? should it be 1 based? (i thought until now that the0-based start and 1-based end in bed format was compatible with htseq).

The BED format is 0-based, by definition, but you aren't giving it a BED file, you're giving it a GFF file, which by definition is 1-based. You shouldn't produce 0-based GFFs or 1-based BEDs because programs that take these files as input will expect them to conform to the definition of the format. For example, because BAM files are 0-based, when you run htseq-count with a BAM file and a GFF file, it will properly shift everything by 1 bp. If you give it a hand-crafted 0-based GFF file (assuming it doesn't produce the error you're getting) then it would still do this shifting but it would be incorrect and reads that are right on the edge could be counted as outside using a 0-based GFF and inside using a 1-based GFF, or vice versa. This 1 bp difference may not have a huge impact on the gene counts created, but you never know.

To answer you question, all you need to do is add 1 to both start coordinate in you GFF file. You can do this using the awk command, very similar to ATpoint's answer:

awk 'OFS="\t" {print $1, $2, $3, $4+1, $5, $6, $7, $8, $9}' in.gff > out.gff

Edit: Note that the GFF specification does not define whether the end coordinate is inclusive or exclusive. Ensembl's GTF annotations use an inclusive end, and I believe htseq-count expects this, so I haven't added 1 to the end coordinate which acts to convert the exclusive end used by the BED format into an inclusive end.

ADD COMMENT
0
Entering edit mode

Hi Colin, thank you So much for the very thorough explanation. It's very clear.

I just had one question- These cooridnates were made using bedtools windowmaker. and then i copied the start- end coordinates to a text editor and added the other columns (to make it look like a gtf). so in reality these are 0 based coordinates (from the bedtools output). which means that it is actually open-ended. does this mean i also have to add +1 to the end coordinates in order to make it end-inclusive?

since htseq is 0-based, open ended, if I don't add +1 to the end coordinate column here, wouldn't the last base of every window get omitted when htseq assigns the reads from my bam file? Or perhaps I understood wrong.

Thanks again for the very concise explanation.

ADD REPLY
1
Entering edit mode

Why do you say that htseq is 0-based, open ended? Since it takes a GFF as input, I would expect it to follow the GFF format which is 1-based, and because their default arguments are set to work for Ensembl GTF annotations, I would assume they treat the end as inclusive because that's what Ensembl does. I don't know with 100% certainty that's what htseq does, though.

Because moving from BED to GFF goes from 0-based to 1-based, that means all coordinates should get +1 added to them. To convert an end coordinate from exclusive, which is how it is in the BED format, to inclusive, which is what I believe htseq uses, then you would subtract 1 from the end. So because you are adding 1 due to the shift from 0-based to 1-based, and subtracting 1 to convert exclusive to inclusive, the net result is to not change the end coordinate.

ADD REPLY
0
Entering edit mode

got it! thank you so much! your explanations are very helpful for me to understand the logic of many analyses that i do with this program.

ADD REPLY
0
Entering edit mode
5.3 years ago
michael.ante ★ 3.9k

Hi Zee_S,

Depending on your species, you might lose information converting BED to GTF. You don't have the connection between genes and transcripts on BED-level (in case of multiple transcripts per gene). Having an eukaryote species, converting the BED to GTF will result in overlapping features with different id. This would produce a high "ambiguous" count in your htseq-count results (also in featureCounts).

If you have a one-to-one relationship of genes and transcripts, you can use the UCSC admin tools to convert the BED file into a genePred file, which you can convert into a BED file. This is quite useful having a bed12 as a starting point.

Cheers,

Michael

ADD COMMENT

Login before adding your answer.

Traffic: 1732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6