Question

Downloading Microsyntenic Fasta Sequences with Varying Chromosome Formats

0

Entering edit mode

22 months ago

Nicolas • 0

I have been working on analyzing microsyntenic regions between different species using the OMA Python API (https://github.com/DessimozLab/pyomadb). Now I would like to download the fasta sequences of these regions with a script, but it seems that the chromosome formats vary across species, making the extraction process more complex.

Dataframe

For example, when working with species like Bos taurus, I can find and fetch chromosome 13 from the refseq without any issues. However, for other species, such as Ailuropoda melanoleuca, the chromosome is represented as an "unplaced genomic scaffold" with the accession number GL192479.1, and the previous approach doesn't work.

I am relatively new to working with this type of data, so there's a possibility that I might overlook something. If you have any other suggestions or programs to accomplish this task, I would greatly appreciate your input

Thanks!

microsyntenic-region fasta oma chromosomes • 826 views

ADD COMMENT • link 22 months ago by Nicolas • 0

score 1 · Answer 1 · 2023-08-07

1

Entering edit mode

22 months ago

GenoMax 151k

Have you tried removing word scaffold and using just the accession?

You can use EntrezDirect in this way (as example):

$ efetch -db nuccore -id GL192479.1 -seq_start 1792869 -seq_stop 1792900 -format fasta
>GL192479.1:1792869-1792900 Ailuropoda melanoleuca unplaced genomic scaffold scaffold26, whole genome shotgun sequence
TATCCAGCTCACATAGAAGACATTGACTACGA

ADD COMMENT • link 22 months ago by GenoMax 151k

0

Entering edit mode

The thing is that I have cases where the value in the chromosome columns is just "15", not the accession number, and sometimes there is an accession with out the word scaffold. So I guess I will need to deal with this with an if statement with regex to discriminate each case an treat them differently.

Thanks!

ADD REPLY • link 22 months ago by Nicolas • 0

score 1 · Answer 2 · 2023-08-07

Hi Nicolas,

on the omabrowser you can also download all the CDS and Protein sequences, either as a single fasta file or also via the API. You can load the sequences for a specific protein with c.proteins[<id>]. If you need also the intergenetic sequences, I think the approach by GenoMax might be a solution, but as the data in OMA originates from many different sources, it might not always be possible to use the EntrezDirect.

Best wishes Adrian