why do you want the primary transcript? I don't know if they really exists. The positions in mirbase are the direct precursors of miRNAs. Normally this is enough to perform the majority of the analysis. If you tell us what is the main goal maybe we can tell more. Normally the full primary transcript is understood as the fragment from the TSS of the fragment (that will generate the short precursors) until the end (never heard of any approximation for this). It seems that the primary transcript is quite long and can generate multiple miRNAs precursors, and there are some papers that have described the promoter of these full transcript, but there is no a consensus here.
I made a script that finds likely miRNA primary transcripts in gencode data:
# Script that finds likely mirna primary transcripts in gencode data by seeing
# which long non-coding RNAs have miRNA completely overlapping them on the same strand.
#
# The output is the list of names of the long noncoding RNA transcripts that are likely to
# be primary miRNA transcripts.
#
# shuf -n 5 results/likely_mirna_primary_transcripts.bed
# MIR1302-9-001
# MIR663AHG-029
# LINC00461-013
# MIR1539-001
# AC024084.1-001
#
# Get the regular and long RNA gtf files here: http://www.gencodegenes.org/releases/21.html
# , then move them into your "data" folder.
# You also need bedtools installed: https://github.com/arq5x/bedtools2
#
# See https://github.com/endrebak/biolo-gists for other useful bioinformatics code snippets.
mkdir -p data && mkdir -p temp && mkdir -p results
# INPUT FILES
GENCODE_GTF_REGULAR="data/gencode.v21.annotation.gtf"
GENCODE_GTF_LNCRNA="data/gencode.v21.long_noncoding_RNAs.gtf"
# TEMPORARY FILES; NO POINT IN CHANGING NAMES
GENCODE_BED_REGULAR_MIRNA_ONLY="temp/gencode.v21.annotation.bed"
GENCODE_BED_LNCRNA="temp/gencode.v21.long_noncoding_RNAs.bed"
#OUTPUT_FILE
GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS="results/likely_mirna_primary_transcripts.bed"
grep -iP mir ${GENCODE_GTF_REGULAR} | \
awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' - |
tr -d '";' > ${GENCODE_BED_REGULAR_MIRNA_ONLY}
awk '{OFS="\t"} {if ($3 == "transcript") {print $1, $4, $5, $24, 0, $7}}' ${GENCODE_GTF_LNCRNA} | \
tr -d '";' > ${GENCODE_BED_LNCRNA}
bedtools intersect -f 1 -s -wb -a ${GENCODE_BED_REGULAR_MIRNA_ONLY} -b ${GENCODE_BED_LNCRNA} | \
awk '{if ($3-$2 != $9-$8) {print $10}}' - | sort -V | uniq > ${GENCODE_BED_LIKELY_PRIMARY_TRANSCRIPTS}
There are a few studies on this. See this post. In the second paper in my answer there is both a prediction of transcription start and stop sites for a bunch of miRNA.
Of course the pri-miRNA are very difficult to identify, since they are very short-lived. Normally with a small RNA-seq you only see the mature miRNA, not even the pre-miRNA. so most of them are completely unknown.
Hi,
why do you want the primary transcript? I don't know if they really exists. The positions in mirbase are the direct precursors of miRNAs. Normally this is enough to perform the majority of the analysis. If you tell us what is the main goal maybe we can tell more. Normally the full primary transcript is understood as the fragment from the TSS of the fragment (that will generate the short precursors) until the end (never heard of any approximation for this). It seems that the primary transcript is quite long and can generate multiple miRNAs precursors, and there are some papers that have described the promoter of these full transcript, but there is no a consensus here.