Question

Select set of genes present in all species tested and unique within each genome/transcriptome

1

Entering edit mode

9.2 years ago

mforthman ▴ 50

I have a set of transcriptomes (nucleotide sequences) from five non-model species and one genome from a sixth species. For the genome, I have an annotated CDS .fna file, nucleotide and amino acid sequences, all of which is not available from any of the major genome/proteome databases. I am wanting to select a set of genes that are putatively orthologous, are single-copy (unique within each genome), and are present in all six species. I need nucleotide sequences in the end that meet these criteria.

I have extremely little bioinformatic skills, and any kind explanation and advice would be much appreciated since I am trying to learn. I thought of within-species BLAST (alternatively BLAT or Bowtie) to identify unique gene regions for each transcriptome/genome. However, I would expect a very lengthy output from these runs that would include everything (single-copy genes and multiple-copy genes). How could I get an output of the single-copy genes?

Once I've managed to do that, I would then need to do across-species BLAST to determine which genes are present in all species. To being this process, would I need to use makeblastdb on the set of single-copy genes and subsequently use this database to query against a .fasta file of all single-copy genes (e.g., blastn)? Like the previous step, I expect a lot in the resulting output in terms of data I want and don't want, so how would I be able to deal with this? Would reciprocal best hit with BLAST be a better option to produce what I'm wanting?

Sorry if this seems way too basic. I admit I do not have a strong or even moderate bioinformatic background, but trying to become more experienced.

transcriptome genome exon capture • 2.2k views

ADD COMMENT • link 9.2 years ago by mforthman ▴ 50

score 0 · Answer 1 · 2016-06-03

You probably need something like inParanoid (http://inparanoid.sbc.su.se/cgi-bin/index.cgi). This is a stand-alone tool, which you can obtain by sending the developers a email (there is no download link for it).

You'll need bioinformatics skills to use this tool, so you either learn it yourself or try to find collaboration.

score 0 · Answer 2 · 2016-06-05

I've looked at using inParanoid, but it sounds as if it can only find orthologs between two or three transcriptomes/genomes at any given time, rather than the 6 species I have all together. Furthermore, it requires protein sequence outputs, which I have for only one species; the rest I would need to translate in all six reading frames, pull out those that are "more likely" to be the correct translation, and gather the translated amino acid sequences into separate fasta files for each species.

Maybe HaMStRAD would be better?