Question

Is there a command to annotate genes from a txt file?

0

Entering edit mode

2.5 years ago

mt_pereira • 0

Hello!

I am doing the scanpy pipeline for scRNA quality control analysis: https://github.com/mousepixels/sanbomics_scripts/blob/main/Scanpy_intro_pp_clustering_markers.ipynb

For annotating the group of mitochondrial genes, in the pipeline they use the command:

adata.var['mt'] = adata.var_names.str.startswith('MT-')

where adata.var_names contains the gene names. However, in my dataset the mitochondrial genes do not have a starting pattern such as 'MT-', I have them all in a txt file which goes like the following:

mt-nd3
mt-nd4
mt-nd4l
mt-nd5
mt-nd6
NC_002333.1
NC_002333.10
NC_002333.11
NC_002333.12

My idea was to load the txt in the notebook:

x = open('michondrialgenesDR11.txt', 'r')
mitogenes = x.read()

and now assign these specific genes to adata.var['mt']. However, since they do not have the same start, I am not sure how to assign them all in the variable.

Can anyone help? Thanks a lot.

EDIT - code for when the mitocondrial genes all start with 'MT-' and annotate this group of genes as 'mt'

adata = sc.read_10x_mtx(
    'tutorial_sample/outs/filtered_feature_bc_matrix/',
    var_names='gene_symbols',
    cache=True)
adata.var['mt'] = adata.var_names.str.startswith('MT-')

The problem is for when not all the mitochondrial genes have the same beginning.

In the pipeline, adata.var_names contains the genes of a count matrix that has been converted into adata (AnnData object) which start with MT- (.str.startswith('MT-')). For me, has to contain all the genes that are in the txt file mentioned above.

Thank you!

scRNA python • 1.6k views

ADD COMMENT • link 2.5 years ago by mt_pereira • 0

1

Entering edit mode

Use 10101 edit button to format relevant text as code in future. I have done it for you this time.

ADD REPLY • link 2.5 years ago by GenoMax 152k

1

Entering edit mode

If you take a look at https://stackoverflow.com/questions/20461847/str-startswith-with-a-list-of-strings-to-test-for so you can specify multiple strings so you could try

adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'NC'))

ADD REPLY • link 2.5 years ago by GenoMax 152k

0

Entering edit mode

cross-posted: https://stackoverflow.com/questions/75172257/

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 166k

score 1 · Answer 1 · 2023-01-19

The most generic way would be to get the GTF file that this dataset annotation is based on and then filter for genes located on the mitochondrial chromosome. The mt- (or uppercase) prefix is convenient, but not "good" in terms of systematic, as gene names could also be replaced by gene id (ENS...) from Ensembl and then you cannot derive anything from prefixes. In you example there is lowercase mt- but you're using an uppercase query string.