Mapping RNA seq reads from bed file to genes
2
0
Entering edit mode
3.7 years ago
jrose ▴ 30

I have a bed file with columns: chrom, read_start, read_end, id, strand

where every row represents a read from an RNA seq data set (e.g. roadmap).

I would like to map every row to a gene (ensemble id/ gene name). I have an annotation file taken from ensembl (ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/) . Is an appropriate strategy to find the rows in the annotation file which intersects the current bed file row, and look for the row in the annotation file where the column type value is "gene"?

RNAseq gene-expression tpm • 2.0k views
ADD COMMENT
2
Entering edit mode
3.7 years ago

To get genes into BED format:

$ wget -qO- ftp://ftp.ensembl.org//pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz | gunzip -c | awk '{ print $0" transcript_id \"\";"}' | gtf2bed - | awk '($8=="gene")' > Homo_sapiens.GRCh38.103.gene.bed

It is necessary to add a blank transcript_id. Unfortunately, Ensembl GTF files do not follow specification, but the awk statement fixes that.

To prep reads:

$ sort-bed reads.bed > reads.sorted.bed

You may want to check that the chromosome field in reads.sorted.bed use the Ensembl chromosome naming scheme, i.e., 1, 2, etc. If your BED file uses the UCSC scheme (chr1, chr2, etc.) then the chr string needs to be removed, e.g. with awk or similar. Chromosome name schemes should match, or mapping will not work.

To map reads to each gene:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped Homo_sapiens.GRCh38.103.gene.bed reads.sorted.bed > reads_mapped_to_genes.bed

To map genes to each read:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped reads.sorted.bed Homo_sapiens.GRCh38.103.gene.bed > genes_mapped_to_reads.bed

Reference:

ADD COMMENT
1
Entering edit mode

Worked great! Much appreciated!

ADD REPLY
0
Entering edit mode

Thanks! Trying this out now!

ADD REPLY
0
Entering edit mode
3.7 years ago
heskett ▴ 110

It seems like you might be trying to reinvent the wheel here since an alignment tool like STAR will do all the counting for you given the raw reads and the gene GTFs. However if youre stuck with a bed file of reads, all you need to do is convert the GTF to bed simply using awk, and then do a bedtools intersect or bedtools map which will do the gene counting for you

ADD COMMENT
0
Entering edit mode

you are correct. I am stuck with the bed files (don't have raw reads due to privacy restrictions).

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6