Where can I find mirna primary transcripts and their genomic coordinates?
2
0
Entering edit mode
10.1 years ago

I tried looking in mirbase for humans, but as the file header specifically states:

# Note, these sequences do not represent the full primary transcript,
# rather a predicted stem-loop portion that includes the precursor
# miRNA.

Looking in gencode did not help either. Grepping for mirna and looking for transcripts with a length of over 150 yielded only 19 results.

grep miRNA gencode.v21.annotation.gff3 | sort -V -k1,1 -k4,4n -k5,5n - | awk '{if ($5-$4>150 && b != $4 && c != $5) {a += 1; b = $4; c = $5; print $0}} END {print a}'

Where can I find primary transcripts for mirna (preferably with genomic coordinates so that I do not need to run bowtie myself)?

mirbase mirna • 4.1k views
ADD COMMENT
0
Entering edit mode

Hi,

why do you want the primary transcript? I don't know if they really exists. The positions in mirbase are the direct precursors of miRNAs. Normally this is enough to perform the majority of the analysis. If you tell us what is the main goal maybe we can tell more. Normally the full primary transcript is understood as the fragment from the TSS of the fragment (that will generate the short precursors) until the end (never heard of any approximation for this). It seems that the primary transcript is quite long and can generate multiple miRNAs precursors, and there are some papers that have described the promoter of these full transcript, but there is no a consensus here.

ADD REPLY
2
Entering edit mode
10.1 years ago

I made a script that finds likely miRNA primary transcripts in gencode data:

# Script that finds likely mirna primary transcripts in gencode data by seeing
# which long non-coding RNAs have miRNA completely overlapping them on the same strand.
#
# The output is the list of names of the long noncoding RNA transcripts that are likely to
# be primary miRNA transcripts.
#
# shuf -n 5 results/likely_mirna_primary_transcripts.bed
# MIR1302-9-001
# MIR663AHG-029
# LINC00461-013
# MIR1539-001
# AC024084.1-001
#
# Get the regular and long RNA gtf files here: http://www.gencodegenes.org/releases/21.html
# , then move them into your "data" folder.
# You also need bedtools installed: https://github.com/arq5x/bedtools2
#
# See https://github.com/endrebak/biolo-gists for other useful bioinformatics code snippets.

mkdir -p data && mkdir -p temp && mkdir -p results

# INPUT FILES
GENCODE_GTF_REGULAR="data/gencode.v21.annotation.gtf"
GENCODE_GTF_LNCRNA="data/gencode.v21.long_noncoding_RNAs.gtf"

# TEMPORARY FILES; NO POINT IN CHANGING NAMES
GENCODE_BED_REGULAR_MIRNA_ONLY="temp/gencode.v21.annotation.bed"
GENCODE_BED_LNCRNA="temp/gencode.v21.long_noncoding_RNAs.bed"

#OUTPUT_FILE
GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS="results/likely_mirna_primary_transcripts.bed"


grep -iP mir ${GENCODE_GTF_REGULAR} | \
awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' - |
tr -d '";' > ${GENCODE_BED_REGULAR_MIRNA_ONLY}

awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' ${GENCODE_GTF_LNCRNA} | \
tr -d '";' > ${GENCODE_BED_LNCRNA}

bedtools intersect -f 1 -s -wb -a ${GENCODE_BED_REGULAR_MIRNA_ONLY} -b ${GENCODE_BED_LNCRNA} | \
awk '{if ($3-$2 != $9-$8) {print $10}}' - | sort -V | uniq > ${GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS}
ADD COMMENT
1
Entering edit mode
10.1 years ago
Martombo ★ 3.1k

There are a few studies on this. See this post. In the second paper in my answer there is both a prediction of transcription start and stop sites for a bunch of miRNA.

Of course the pri-miRNA are very difficult to identify, since they are very short-lived. Normally with a small RNA-seq you only see the mature miRNA, not even the pre-miRNA. so most of them are completely unknown.

ADD COMMENT

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6