Question

Clustering RBH hits from multiple outputs

0

Entering edit mode

8.5 years ago

mforthman ▴ 50

EDITED POST: This post has been significantly re-edited from original post, but question remains the same between versions.

I have performed RBH for each of five transcriptomes against a reference genome, producing five separate output files. I figured, for ease, it was best to combine these output files into one using cat command. The output looks like so, e.g.,:

Acur_1001484    1751    OFAS000562-RA-EXON01    467 1   1039    1240    1   467 264 4.00E-62    237 87.75   179 204 1   264 467 1   1240    1039    1.00E-62    237 87.75   179 204
Apil_comp17655_c0_seq1  2446    OFAS000562-RA-EXON01    467 1   288 453 1   289 454 2.00E-66    252 93.98   156 166 1   289 454 1   288 453 1.00E-66    252 93.98   156 166
Atri_comp16713_c0_seq1  2069    OFAS000562-RA-EXON01    467 1   1576    1754    1   467 289 5.00E-67    254 92.18   165 179 1   289 467 1   1754    1576    4.00E-67    254 92.18   165 179
Btri_comp12490_c0_seq1  2547    OFAS000562-RA-EXON01    467 1   2069    2267    1   467 268 2.00E-66    252 89.5    179 200 1   268 467 1   2267    2069    1.00E-66    252 89.5    179 200
Atri_comp10477_c0_seq1  2608    OFAS000604-RA-EXON01    1887    1   4   1159    1   298 1537    0   701 78.5    997 1270    1   298 1537    1   4   1159    0   701 78.54   999 1272
Ctom_1008131    415 OFAS000604-RA-EXON01    1887    1   101 399 1   1260    959 1.00E-81    300 85.06   262 308 1   959 1260    1   399 101 4.00E-81    300 85.06   262 308
Acur_1001075    997 OFAS000604-RA-EXON01    1887    1   469 954 1   756 1258    1.00E-125   448 83.69   431 515 1   247 650 1   16  417 1.00E-121   435 86.54   360 416
Btri_comp18753_c0_seq1  2786    OFAS000604-RA-EXON01    1887    1   420 1181    1   761 1593    1.00E-108   392 77.03   654 849 1   297 650 1   4   336 1.00E-87    324 84.49   305 361
Apil_comp16297_c0_seq1  1235    OFAS000628-RA-EXON01    453 1   182 553 1   372 1   4.00E-66    250 78.93   296 375 1   1   372 1   553 182 4.00E-66    250 78.93   296 375
Atri_comp15731_c0_seq1  657 OFAS000628-RA-EXON01    453 1   131 517 1   387 1   9.00E-55    211 76.73   300 391 1   1   387 1   517 131 2.00E-54    211 76.73   300 391
Apil_comp14961_c1_seq1  1230    OFAS000635-RA-EXON02    212 1   254 452 1   1   198 2.00E-54    211 85.93   171 199 1   1   198 1   254 452 8.00E-55    211 85.93   171 199

For a given reference gene (e.g., OFAS000562-RA-EXON01), I would now like to retrieve the query IDs and the corresponding reference gene ID and copy this data into a new text file named as the reference gene ID (e.g., OFAS000562-RA-EXON01_IDs.txt). Below are four examples of what I'm wanting in terms of output from this approach:

File 1:

OFAS000562-RA-EXON01
Acur_1001484
Apil_comp17655_c0_seq1
Atri_comp16713_c0_seq1
Btri_comp12490_c0_seq1

File 2:

OFAS000604-RA-EXON01
Acur_1001075
Btri_comp18753_c0_seq1
Atri_comp10477_c0_seq1
Ctom_1008131

File 3:

OFAS000628-RA-EXON01
Apil_comp16297_c0_seq1
Atri_comp15731_c0_seq1

File 4:

OFAS000635-RA-EXON02
Apil_comp14961_c1_seq1

I don't have strong bioinformatic/scripting skills, so I appreciate any feedback on how to do this.

Reciprocal best hits blast clustering • 1.9k views

ADD COMMENT • link 8.4 years ago by mforthman ▴ 50

score 0 · Answer 1 · 2016-06-21

0

Entering edit mode

8.4 years ago

tomc ▴ 90

with your first filter as an example

awk 'BEGIN{print "OFAS000562-RA-EXON01"}$3=="OFAS000562-RA-EXON01"{print $1}' < file0 > file1

more generically make your query a variable

gene=OFAS000562-RA-EXON01
awk -vGENE=${gene}" 'BEGIN{print GENE}$3==GENE{print $1}' < file0 >  ${gene}.rbh

You may have a list you want to query

for gene in list; do
    awk -vGENE=${gene}" 'BEGIN{print GENE}$3==GENE{print $1}' < file0 >  ${gene}.rbh
done

ADD COMMENT • link 8.4 years ago by tomc ▴ 90

0

Entering edit mode

If I understand, I would need a list of the reference genes from my RBH output (that is OFAS.....), which I would then need to replace 'list' with the file name?

ADD REPLY • link 8.4 years ago by mforthman ▴ 50

0

Entering edit mode

I tried the first suggestion, and it produces a file with the reference gene ID, but not any of the query gene IDs that correspond to the reference gene ID

I've also tried the last suggestion (list) and I get the following errors:

line 2: unexpected EOF while looking for matching `"' line 4: syntax error: unexpected end of file

My file didn't even have a line 4

ADD REPLY • link 8.4 years ago by mforthman ▴ 50