Extending a discussion: https://groups.google.com/forum/#!msg/humann-users/0rbswpcxL1M/4mZNbNd8DAAJ
According to the conversation above this is what appears to be going on, PLEASE CORRECT WHERE NECESSARY:
Reads would be mapped to genes in the following files in the [name]_temp/ directory: 1a. [name]_diamond_aligned.tsv
NS500647:186:HV3F5BGX2:1:11101:17415:10831:N:0:CTCAGA gi|400294433|ref|NZ_ALJK01000240.1|:c8692-7928|1655|g__Actinomyces.s__Actinomyces_naeslundii|UniRef90_J2ZLR9|UniRef50_R5IQW2|765 99.18032786885246 122.0 0 NS500647:186:HV3F5BGX2:1:11101:20301:11561:N:0:CTCAGA gi|288801553|ref|NZ_GG740010.1|:c46051-44267|28132|g__Prevotella.s__Prevotella_melaninogenica|UniRef90_D9RTD5|UniRef50_R5FHH6|1785 93.27731092436974 119.0 0 NS500647:186:HV3F5BGX2:1:11101:21205:85401:N:0:CTCAGA gi|512460964|ref|NZ_KE150253.1|:c257004-255034|45242|g__Capnocytophaga.s__Capnocytophaga_granulosa|UniRef90_J5Y4E1|UniRef50_F8EB68|1971 95.86206896551724 145.0
1b. [name]_bowtie2_aligned.tsv NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_D1BPC2|753 75.0 40 10 0 6 125 212 251 1.5e-11 70.9 NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_K0Y074|828 73.2 41 11 0 3 125 236 276 2.5e-11 70.1 NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_E4L934|762 75.0 40 10 0 3 122 214 253 2.5e-11 70.1
These genes/proteins/protein-cluster from [name]_bowtie2_aligned.tsv without taxonomy information and from [name]_diamond_aligned.tsv with taxonomy information would be associated with the reactions from the following file: 2a. humann2/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2
GLUCOSE-1-PHOSPHATE-PHOSPHODISMUTASE-RXN 2.7.1.41 UniRef50_G6EMD2 UniRef50_Q48UI5 UniRef50_T0UKK6 UniRef90_F4BRY1 UniRef90_G4L3Q0 UniRef90_G6EMD2 UniRef90_K0J4V1 UniRef90_R9TSN7 UniRef90_T0T682 UniRef90_T0UKK6
The reactions from this would be associated with the pathway identifiers from this file: humann2/data/pathways/metacyc_pathways
PWY-2681 RXN-4303 RXN-4304 RXN-4310 RXN-4305 RXN-4306 RXN-4312 RXN-4308 RXN-4314 RXN-4307 RXN-4313 RXN-4317 PWY1G-126 1.8.1.15-RXN RXN1G-6 METHGLYUT-PWY 1.1.1.283-RXN LACTALDDEHYDROG-RXN L-LACTDEHYDROGFMN-RXN RXN0-4281 RXN-8632 GLYOXIII-RXN GLYOXI-RXN GLYOXII-RXN DLACTDEHYDROGFAD-RXN
Is the above pipeline correct or am I missing details?
My specific questions about edge cases:
(i) PWY-5030: L-histidine degradation III|g__Streptococcus.s__Streptococcus_sanguinis
Would this HUMAnN2 attribute from the abundance profile contain all of the genes from all of the reactions in the [name]_diamond_aligned.tsv since there is taxonomy associated with the identifier?
(ii) PWY-2942: L-lysine biosynthesis III
Would this one be from all of the genes in [name]_bowtie2_aligned.tsv since there is no taxonomy information in the identifier and this does not contain taxonomy information because it was identifier to an orthologous group?
(iii) UNINTEGRATED|g__Streptococcus.s__Streptococcus_sanguinis
I'm not sure where the read -> organism mapping file is located.
(iv) Many of the UniRef(5/9)0_XYZ identifiers are not in in the humann2/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2 file. How could these be handled? Is there another file where I should be looking for this information?