I want to retrieve the location of a feature of a miRNA stored in miRBase in EMBL format. I downloaded miRNA.dat from https://mirbase.org/download/. And imported miRNA.dat in Python to study the location of the feature. I have some problems with the location format. The following is a snippet in miRNA.dat showing the first entry.
ID cel-let-7 standard; RNA; CEL; 99 BP.
XX
AC MI0000001;
XX
DE Caenorhabditis elegans let-7 stem-loop
XX
RN [1]
RX PUBMED; 11679671.
RA Lau NC, Lim LP, Weinstein EG, Bartel DP;
RT "An abundant class of tiny RNAs with probable regulatory roles in
RT Caenorhabditis elegans";
RL Science. 294:858-862(2001).
XX
RN [2]
RX PUBMED; 12672692.
RA Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB,
RA Bartel DP;
RT "The microRNAs of Caenorhabditis elegans";
RL Genes Dev. 17:991-1008(2003).
XX
RN [3]
RX PUBMED; 12747828.
RA Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D;
RT "MicroRNAs and other tiny endogenous RNAs in C. elegans";
RL Curr Biol. 13:807-818(2003).
XX
RN [4]
RX PUBMED; 12769849.
RA Grad Y, Aach J, Hayes GD, Reinhart BJ, Church GM, Ruvkun G, Kim J;
RT "Computational and experimental identification of C. elegans microRNAs";
RL Mol Cell. 11:1253-1263(2003).
XX
RN [5]
RX PUBMED; 17174894.
RA Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP;
RT "Large-scale sequencing reveals 21U-RNAs and additional microRNAs and
RT endogenous siRNAs in C. elegans";
RL Cell. 127:1193-1207(2006).
XX
RN [6]
RX PUBMED; 19460142.
RA Kato M, de Lencastre A, Pincus Z, Slack FJ;
RT "Dynamic expression of small non-coding RNAs, including novel microRNAs
RT and piRNAs/21U-RNAs, during Caenorhabditis elegans development";
RL Genome Biol. 10:R54(2009).
XX
RN [7]
RX PUBMED; 20062054.
RA Zisoulis DG, Lovci MT, Wilbert ML, Hutt KR, Liang TY, Pasquinelli AE, Yeo
RA GW;
RT "Comprehensive discovery of endogenous Argonaute binding sites in
RT Caenorhabditis elegans";
RL Nat Struct Mol Biol. 17:173-179(2010).
XX
DR RFAM; RF00027; let-7.
DR WORMBASE; C05G5/12462-12364; .
XX
CC let-7 is found on chromosome X in Caenorhabditis elegans [1] and pairs to
CC sites within the 3' untranslated region (UTR) of target mRNAs, specifying
CC the translational repression of these mRNAs and triggering the transition
CC to late-larval and adult stages [2].
XX
FH Key Location/Qualifiers
FH
FT miRNA 17..38
FT /accession="MIMAT0000001"
FT /product="cel-let-7-5p"
FT /evidence=experimental
FT /experiment="cloned [1-3], Northern [1], PCR [4], 454 [5],
FT Illumina [6], CLIPseq [7]"
FT miRNA 60..81
FT /accession="MIMAT0015091"
FT /product="cel-let-7-3p"
FT /evidence=experimental
FT /experiment="CLIPseq [7]"
XX
SQ Sequence 99 BP; 26 A; 19 C; 24 G; 0 T; 30 other;
uacacugugg auccggugag guaguagguu guauaguuug gaauauuacc accggugaac 60
uaugcaauuu ucuaccuuac cggagacaga acucuucga 99
//
It shows that the first feature has a key value equal to "miRNA" in position 17..38. (I interpret it as from 17 to 38). It is the same as shown in https://mirbase.org/hairpin/MI0000001
I used the following Python code to retrieve the location information.
from Bio import SeqIO
import pandas as pd
records_data = []
with open('data/miRNA.dat', 'r') as file:
for record in SeqIO.parse(file, 'embl'):
record_dict = {
'Name': record.name,
'Accession': record.id,
'Sequence': str(record.seq), # Convert sequence to string
'miRNA_1_Location': None,
}
# Try to retrieve the 1st feature
try:
print("[INFO] 1st feature exists!")
if record.features[0].type == 'miRNA':
record_dict['miRNA_1_Product'] = record.features[0].qualifiers.get('product', [''])[0]
record_dict['miRNA_1_Location'] = str(record.features[0].location)
record_dict['miRNA_1_Evidence'] = record.features[0].qualifiers.get('evidence', [''])[0]
except IndexError:
print("[INFO] 1st feature does not exist!")
records_data.append(record_dict)
df = pd.DataFrame(records_data)
By running the above code, I can obtain the miRNA_1_Location of the first entry (in a string format) as
[16:38](+)
My question is, how to interpret it? The original location stored in the miRNA.dat is
17..38
Why does the starting position differ by one, while the ending position is the same? And what is the meaning of that + sign? Thanks
I am clear now. Thanks for your explanation.
I know many programming languages, including Python, count things starting at 0. Maybe because of the offset notion. The 2nd element is one position behind the 1st element in an array Hence, the 1st element will be 0 offset from the beginning (Where the array pointer is pointed to, maybe), and the 2nd element will be 1 offset from the beginning.
But it is my first time knowing this "python are in the 0-based, half-open system". Thanks again