Question

Orthogroups.csv file for orthofinder

2

Entering edit mode

6.2 years ago

mxlsherry1992 ▴ 80

Dear all,

To interpret the orthofinder output file Orthogroups.csv, if I have three input protein fasta file, the output Orthogroups.csv is like below, the first two species have no reference genome, so its' ID looks like"Trinity_DN_...", since the ID has similar format for the first two species( Clarias, Pan), how could I identify if they are from the first species (Clarias) or third species (Pan)..

enter image description![enter image description here][1] here

RNA-Seq Assembly alignment sequencing • 2.7k views

ADD COMMENT • link updated 6.1 years ago by david_emms ▴ 160 • written 6.2 years ago by mxlsherry1992 ▴ 80

1

Entering edit mode

if the IDs used in each set are not unique you likely will run into trouble (I'm already surprised that blast did not complain on this?). Before running orthofinder it's a good idea to prefix the IDs from each set with a 'code' that indicates the species it's from.

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

1

Entering edit mode

I think OrthoFinder does the conversion before running BLAST for you, for example in the WorkingDirectory I got:

$ head SpeciesIDs.txt SequenceIDs.txt
==> SpeciesIDs.txt <==
0: Athaliana.fasta
1: Bdistachyon.fasta
2: Hvulgare.fasta
3: Osativa.fasta
4: Pglaucum.fasta
5: Sbicolor.fasta
6: Sitalica.fasta
7: Zmays.fasta

==> SequenceIDs.txt <==
0_0: AT1G50920.1 | Symbols:  | Nucleolar GTP-binding protein | chr1:18870555-18872570 FORWARD LENGTH=671
0_1: AT1G36960.1 | Symbols:  | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G48095.1); Has 54 Blast hits to 54 proteins in 2 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 54; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink). | chr1:14014796-14015508 FORWARD LENGTH=181
0_2: AT1G44020.1 | Symbols:  | Cysteine/Histidine-rich C1 domain family protein | chr1:16716692-16718656 REVERSE LENGTH=577
0_3: AT1G15970.1 | Symbols:  | DNA glycosylase superfamily protein | chr1:5486544-5488494 REVERSE LENGTH=352
0_4: AT1G73440.1 | Symbols:  | calmodulin-related | chr1:27611418-27612182 FORWARD LENGTH=254
0_5: AT1G75120.1 | Symbols: RRA1 | Nucleotide-diphospho-sugar transferase family protein | chr1:28197022-28198656 REVERSE LENGTH=402
0_6: AT1G17600.1 | Symbols:  | Disease resistance protein (TIR-NBS-LRR class) family | chr1:6053026-6056572 REVERSE LENGTH=1049
0_7: AT1G51380.1 | Symbols:  | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392
0_8: AT1G77370.1 | Symbols:  | Glutaredoxin family protein | chr1:29073916-29074642 FORWARD LENGTH=130
0_9: AT1G44090.1 | Symbols: ATGA20OX5, GA20OX5 | gibberellin 20-oxidase 5 | chr1:16760677-16762486 REVERSE LENGTH=385

$ grep '^>' Species0.fa | head
>0_0
>0_1
>0_2
>0_3
>0_4
>0_5
>0_6
>0_7
>0_8
>0_9

ADD REPLY • link 6.2 years ago by AK ★ 2.2k

score 2 · Answer 1 · 2019-05-19

Hi mxlsherry1992,

In the newer version of OrthoFinder (here for example 2.3.1), several output files become tab delimited (Change file endings to .tsv as appropriate).

And in the output file Orthogroups.tsv, the members in each family from different input sequence files are separated by a tab:

    Athaliana   Hvulgare    Osativa Pglaucum    Sbicolor    Sitalica    Zmays
OG0010401   AT1G09410.1, AT1G56690.1    HORVU4Hr1G052340.1  LOC_Os03g20190.1    Pgl_GLEAN_10026176      Seita.9G424600.1.p  Zm00001d028935_P001

By using the newer version, the members of your first two species (Clarias, Pan) will be separated by a tab and appear in the second and third columns of "Orthogroups.tsv", so you can identify them by selecting a specific column regardless of the naming.

score 1 · Answer 2 · 2019-05-29

1

Entering edit mode

6.1 years ago

david_emms ▴ 160

Hi

Just following on from what SMK said, the Orthogroups.csv file was also a tab-delimited file. Genes from difference species are separated by a tab and genes within the same species are separated with a comma. If you open it in a spreadsheet program (e.g. Excel, LibreOffice Calc) and chose 'tab' as the delimiter then it will display correctly.

All the best David

ADD COMMENT • link 6.1 years ago by david_emms ▴ 160

0

Entering edit mode

Thank you! I got it!!!!

ADD REPLY • link 6.1 years ago by mxlsherry1992 ▴ 80