Reading location of a feature of a miRNA entry in miRBase in EMBL format
1
0
Entering edit mode
25 days ago
Coleman • 0

I want to retrieve the location of a feature of a miRNA stored in miRBase in EMBL format. I downloaded miRNA.dat from https://mirbase.org/download/. And imported miRNA.dat in Python to study the location of the feature. I have some problems with the location format. The following is a snippet in miRNA.dat showing the first entry.

ID   cel-let-7         standard; RNA; CEL; 99 BP.
XX
AC   MI0000001;
XX
DE   Caenorhabditis elegans let-7 stem-loop
XX
RN   [1]
RX   PUBMED; 11679671.
RA   Lau NC, Lim LP, Weinstein EG, Bartel DP;
RT   "An abundant class of tiny RNAs with probable regulatory roles in
RT   Caenorhabditis elegans";
RL   Science. 294:858-862(2001).
XX
RN   [2]
RX   PUBMED; 12672692.
RA   Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB,
RA   Bartel DP;
RT   "The microRNAs of Caenorhabditis elegans";
RL   Genes Dev. 17:991-1008(2003).
XX
RN   [3]
RX   PUBMED; 12747828.
RA   Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D;
RT   "MicroRNAs and other tiny endogenous RNAs in C. elegans";
RL   Curr Biol. 13:807-818(2003).
XX
RN   [4]
RX   PUBMED; 12769849.
RA   Grad Y, Aach J, Hayes GD, Reinhart BJ, Church GM, Ruvkun G, Kim J;
RT   "Computational and experimental identification of C. elegans microRNAs";
RL   Mol Cell. 11:1253-1263(2003).
XX
RN   [5]
RX   PUBMED; 17174894.
RA   Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP;
RT   "Large-scale sequencing reveals 21U-RNAs and additional microRNAs and
RT   endogenous siRNAs in C. elegans";
RL   Cell. 127:1193-1207(2006).
XX
RN   [6]
RX   PUBMED; 19460142.
RA   Kato M, de Lencastre A, Pincus Z, Slack FJ;
RT   "Dynamic expression of small non-coding RNAs, including novel microRNAs
RT   and piRNAs/21U-RNAs, during Caenorhabditis elegans development";
RL   Genome Biol. 10:R54(2009).
XX
RN   [7]
RX   PUBMED; 20062054.
RA   Zisoulis DG, Lovci MT, Wilbert ML, Hutt KR, Liang TY, Pasquinelli AE, Yeo
RA   GW;
RT   "Comprehensive discovery of endogenous Argonaute binding sites in
RT   Caenorhabditis elegans";
RL   Nat Struct Mol Biol. 17:173-179(2010).
XX
DR   RFAM; RF00027; let-7.
DR   WORMBASE; C05G5/12462-12364; .
XX
CC   let-7 is found on chromosome X in Caenorhabditis elegans [1] and pairs to
CC   sites within the 3' untranslated region (UTR) of target mRNAs, specifying
CC   the translational repression of these mRNAs and triggering the transition
CC   to late-larval and adult stages [2].
XX
FH   Key             Location/Qualifiers
FH
FT   miRNA           17..38
FT                   /accession="MIMAT0000001"
FT                   /product="cel-let-7-5p"
FT                   /evidence=experimental
FT                   /experiment="cloned [1-3], Northern [1], PCR [4], 454 [5],
FT                   Illumina [6], CLIPseq [7]"
FT   miRNA           60..81
FT                   /accession="MIMAT0015091"
FT                   /product="cel-let-7-3p"
FT                   /evidence=experimental
FT                   /experiment="CLIPseq [7]"
XX
SQ   Sequence 99 BP; 26 A; 19 C; 24 G; 0 T; 30 other;
     uacacugugg auccggugag guaguagguu guauaguuug gaauauuacc accggugaac        60
     uaugcaauuu ucuaccuuac cggagacaga acucuucga                               99
//

It shows that the first feature has a key value equal to "miRNA" in position 17..38. (I interpret it as from 17 to 38). It is the same as shown in https://mirbase.org/hairpin/MI0000001

I used the following Python code to retrieve the location information.

from Bio import SeqIO
import pandas as pd
records_data = []
with open('data/miRNA.dat', 'r') as file:
    for record in SeqIO.parse(file, 'embl'):
        record_dict = {
            'Name': record.name,
            'Accession': record.id,
            'Sequence': str(record.seq),  # Convert sequence to string
            'miRNA_1_Location': None,
        }
        # Try to retrieve the 1st feature
        try:
            print("[INFO] 1st feature exists!")
            if record.features[0].type == 'miRNA':
                record_dict['miRNA_1_Product'] = record.features[0].qualifiers.get('product', [''])[0]
                record_dict['miRNA_1_Location'] = str(record.features[0].location)
                record_dict['miRNA_1_Evidence'] = record.features[0].qualifiers.get('evidence', [''])[0]
        except IndexError:
            print("[INFO] 1st feature does not exist!")
        records_data.append(record_dict)
df = pd.DataFrame(records_data)

By running the above code, I can obtain the miRNA_1_Location of the first entry (in a string format) as

[16:38](+)

My question is, how to interpret it? The original location stored in the miRNA.dat is

17..38

Why does the starting position differ by one, while the ending position is the same? And what is the meaning of that + sign? Thanks

format miRBase EMBL • 345 views
ADD COMMENT
2
Entering edit mode
24 days ago

All coordinates in python are in the 0-based, half-open system. This means that the first base is the sequences is counted as 0 (0-based), and that the end coordinate is the base after the end of the feature.

Like so:


uacacugugg
   ----
   |   |
0123456789

position = [3:7]

Other places you'll see this coordinate system include Bed files and most programming lanuages.

Alternatively, you could think of it as which bonds between bases you'd need to break to extract the sequence (counting posts rather than fences, if you know that analogy).

u-a-c-a-c-u-g-u-g-g-
      -------
     |       |
 1 2 3 4 5 6 7 8 9 0

EMBL on the other hand, uses the 1-based closed system. Here bases are counted from one, and the last base is included. Like so:

uacacugugg
   ----
   |  |
1234567890

Or, counting bases, rather than bonds:

u-a-c-a-c-u-g-u-g-g-
      -------
      |     |
1 2 3 4 5 6 7 8 9 0

Other places you'll see this system include GFF files, and R.

The (+) is which strand of the sequence the feature is on. In this case, that doesn't make much sense, since the sequence is a single stranded miRNA, but BioPython doesn't know that.

ADD COMMENT
0
Entering edit mode

I am clear now. Thanks for your explanation.

I know many programming languages, including Python, count things starting at 0. Maybe because of the offset notion. The 2nd element is one position behind the 1st element in an array Hence, the 1st element will be 0 offset from the beginning (Where the array pointer is pointed to, maybe), and the 2nd element will be 1 offset from the beginning.

But it is my first time knowing this "python are in the 0-based, half-open system". Thanks again

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6