Question

Where can I find mirna primary transcripts and their genomic coordinates?

0

Entering edit mode

10.1 years ago

Endre Bakken Stovner ▴ 970

I tried looking in mirbase for humans, but as the file header specifically states:

# Note, these sequences do not represent the full primary transcript,
# rather a predicted stem-loop portion that includes the precursor
# miRNA.

Looking in gencode did not help either. Grepping for mirna and looking for transcripts with a length of over 150 yielded only 19 results.

grep miRNA gencode.v21.annotation.gff3 | sort -V -k1,1 -k4,4n -k5,5n - | awk '{if ($5-$4>150 && b != $4 && c != $5) {a += 1; b = $4; c = $5; print $0}} END {print a}'

Where can I find primary transcripts for mirna (preferably with genomic coordinates so that I do not need to run bowtie myself)?

mirbase mirna • 4.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

Hi,

why do you want the primary transcript? I don't know if they really exists. The positions in mirbase are the direct precursors of miRNAs. Normally this is enough to perform the majority of the analysis. If you tell us what is the main goal maybe we can tell more. Normally the full primary transcript is understood as the fragment from the TSS of the fragment (that will generate the short precursors) until the end (never heard of any approximation for this). It seems that the primary transcript is quite long and can generate multiple miRNAs precursors, and there are some papers that have described the promoter of these full transcript, but there is no a consensus here.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Lorena Pantano ▴ 380

1

Entering edit mode

10.1 years ago

Martombo ★ 3.1k

There are a few studies on this. See this post. In the second paper in my answer there is both a prediction of transcription start and stop sites for a bunch of miRNA.

Of course the pri-miRNA are very difficult to identify, since they are very short-lived. Normally with a small RNA-seq you only see the mature miRNA, not even the pre-miRNA. so most of them are completely unknown.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Martombo ★ 3.1k

Ram · Accepted Answer · 2014-11-03

I made a script that finds likely miRNA primary transcripts in gencode data:

# Script that finds likely mirna primary transcripts in gencode data by seeing
# which long non-coding RNAs have miRNA completely overlapping them on the same strand.
#
# The output is the list of names of the long noncoding RNA transcripts that are likely to
# be primary miRNA transcripts.
#
# shuf -n 5 results/likely_mirna_primary_transcripts.bed
# MIR1302-9-001
# MIR663AHG-029
# LINC00461-013
# MIR1539-001
# AC024084.1-001
#
# Get the regular and long RNA gtf files here: http://www.gencodegenes.org/releases/21.html
# , then move them into your "data" folder.
# You also need bedtools installed: https://github.com/arq5x/bedtools2
#
# See https://github.com/endrebak/biolo-gists for other useful bioinformatics code snippets.

mkdir -p data && mkdir -p temp && mkdir -p results

# INPUT FILES
GENCODE_GTF_REGULAR="data/gencode.v21.annotation.gtf"
GENCODE_GTF_LNCRNA="data/gencode.v21.long_noncoding_RNAs.gtf"

# TEMPORARY FILES; NO POINT IN CHANGING NAMES
GENCODE_BED_REGULAR_MIRNA_ONLY="temp/gencode.v21.annotation.bed"
GENCODE_BED_LNCRNA="temp/gencode.v21.long_noncoding_RNAs.bed"

#OUTPUT_FILE
GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS="results/likely_mirna_primary_transcripts.bed"


grep -iP mir ${GENCODE_GTF_REGULAR} | \
awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' - |
tr -d '";' > ${GENCODE_BED_REGULAR_MIRNA_ONLY}

awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' ${GENCODE_GTF_LNCRNA} | \
tr -d '";' > ${GENCODE_BED_LNCRNA}

bedtools intersect -f 1 -s -wb -a ${GENCODE_BED_REGULAR_MIRNA_ONLY} -b ${GENCODE_BED_LNCRNA} | \
awk '{if ($3-$2 != $9-$8) {print $10}}' - | sort -V | uniq > ${GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS}