Question

length of combined transcripts

2

Entering edit mode

9.7 years ago

bsp017 ▴ 50

Hi all,

Is there a tool to calculate the combined length of transcripts which map to a given interval. For example as a function of bedtools or htseq? I can calculate the number of reads mapping to a particular gene using these tools but not the length of the gene segment which has been covered. What I want to know is, which genes in a gff or bed file have alignments from my sorted bam file covering more than 100 bp of their entire length.

Thanks,
James

RNA-Seq • 2.2k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.7 years ago by bsp017 ▴ 50

Ram · Accepted Answer · 2015-11-21

BEDOPS bedmap --bases and bedmap --bases-uniq offer total and distinct counts of bases of map elements (e.g., transcripts or reads) which overlap ("map" to) a reference interval (e.g., a set of genes):

$ bedmap --echo --bases genes.bed reads.bed \
    > total_read_base_count_over_genes.bed
$ bedmap --echo --bases-uniq genes.bed reads.bed \
    > distinct_read_base_count_over_genes.bed

It's not clear which option you want, but the --bases-uniq option would effectively merge the overlapping reads into a contiguous region, and then give the length of the overlapping part of the contiguous region. The --bases option simply gives the total sum of the length of the overlapping part of each overlapping read.

Regardless of which option you pick, if you add --delim '\t', you can easily pipe to awk to quickly filter mapped results, where genes have alignments covering 100 bp or more:

$ bedmap --echo --bases --delim '\t' genes.bed reads.bed \
    | awk '$NF >= 100' - \
    > genes_with_all_reads_a_total_of_100_bases_or_more_overlap.bed

If you need to convert BAM-formatted reads to BED, you could use BEDOPS bam2bed and pipe BED-formatted reads to bedmap:

$ bam2bed < reads.bam \
    | bedmap --echo --bases --delim '\t' genes.bed - \
    | awk '$NF >= 100' - \
    > genes_with_all_reads_a_total_of_100_bases_or_more_overlap.bed