Clustering RBH hits from multiple outputs
1
0
Entering edit mode
8.5 years ago
mforthman ▴ 50

EDITED POST: This post has been significantly re-edited from original post, but question remains the same between versions.

I have performed RBH for each of five transcriptomes against a reference genome, producing five separate output files. I figured, for ease, it was best to combine these output files into one using cat command. The output looks like so, e.g.,:

Acur_1001484    1751    OFAS000562-RA-EXON01    467 1   1039    1240    1   467 264 4.00E-62    237 87.75   179 204 1   264 467 1   1240    1039    1.00E-62    237 87.75   179 204
Apil_comp17655_c0_seq1  2446    OFAS000562-RA-EXON01    467 1   288 453 1   289 454 2.00E-66    252 93.98   156 166 1   289 454 1   288 453 1.00E-66    252 93.98   156 166
Atri_comp16713_c0_seq1  2069    OFAS000562-RA-EXON01    467 1   1576    1754    1   467 289 5.00E-67    254 92.18   165 179 1   289 467 1   1754    1576    4.00E-67    254 92.18   165 179
Btri_comp12490_c0_seq1  2547    OFAS000562-RA-EXON01    467 1   2069    2267    1   467 268 2.00E-66    252 89.5    179 200 1   268 467 1   2267    2069    1.00E-66    252 89.5    179 200
Atri_comp10477_c0_seq1  2608    OFAS000604-RA-EXON01    1887    1   4   1159    1   298 1537    0   701 78.5    997 1270    1   298 1537    1   4   1159    0   701 78.54   999 1272
Ctom_1008131    415 OFAS000604-RA-EXON01    1887    1   101 399 1   1260    959 1.00E-81    300 85.06   262 308 1   959 1260    1   399 101 4.00E-81    300 85.06   262 308
Acur_1001075    997 OFAS000604-RA-EXON01    1887    1   469 954 1   756 1258    1.00E-125   448 83.69   431 515 1   247 650 1   16  417 1.00E-121   435 86.54   360 416
Btri_comp18753_c0_seq1  2786    OFAS000604-RA-EXON01    1887    1   420 1181    1   761 1593    1.00E-108   392 77.03   654 849 1   297 650 1   4   336 1.00E-87    324 84.49   305 361
Apil_comp16297_c0_seq1  1235    OFAS000628-RA-EXON01    453 1   182 553 1   372 1   4.00E-66    250 78.93   296 375 1   1   372 1   553 182 4.00E-66    250 78.93   296 375
Atri_comp15731_c0_seq1  657 OFAS000628-RA-EXON01    453 1   131 517 1   387 1   9.00E-55    211 76.73   300 391 1   1   387 1   517 131 2.00E-54    211 76.73   300 391
Apil_comp14961_c1_seq1  1230    OFAS000635-RA-EXON02    212 1   254 452 1   1   198 2.00E-54    211 85.93   171 199 1   1   198 1   254 452 8.00E-55    211 85.93   171 199

For a given reference gene (e.g., OFAS000562-RA-EXON01), I would now like to retrieve the query IDs and the corresponding reference gene ID and copy this data into a new text file named as the reference gene ID (e.g., OFAS000562-RA-EXON01_IDs.txt). Below are four examples of what I'm wanting in terms of output from this approach:

File 1:

OFAS000562-RA-EXON01
Acur_1001484
Apil_comp17655_c0_seq1
Atri_comp16713_c0_seq1
Btri_comp12490_c0_seq1

File 2:

OFAS000604-RA-EXON01
Acur_1001075
Btri_comp18753_c0_seq1
Atri_comp10477_c0_seq1
Ctom_1008131

File 3:

OFAS000628-RA-EXON01
Apil_comp16297_c0_seq1
Atri_comp15731_c0_seq1

File 4:

OFAS000635-RA-EXON02
Apil_comp14961_c1_seq1

I don't have strong bioinformatic/scripting skills, so I appreciate any feedback on how to do this.

Reciprocal best hits blast clustering • 2.0k views
ADD COMMENT
0
Entering edit mode
8.5 years ago
tomc ▴ 90

with your first filter as an example

awk 'BEGIN{print "OFAS000562-RA-EXON01"}$3=="OFAS000562-RA-EXON01"{print $1}' < file0 > file1

more generically make your query a variable

gene=OFAS000562-RA-EXON01
awk -vGENE=${gene}" 'BEGIN{print GENE}$3==GENE{print $1}' < file0 >  ${gene}.rbh

You may have a list you want to query

for gene in list; do
    awk -vGENE=${gene}" 'BEGIN{print GENE}$3==GENE{print $1}' < file0 >  ${gene}.rbh
done
ADD COMMENT
0
Entering edit mode

If I understand, I would need a list of the reference genes from my RBH output (that is OFAS.....), which I would then need to replace 'list' with the file name?

ADD REPLY
0
Entering edit mode

I tried the first suggestion, and it produces a file with the reference gene ID, but not any of the query gene IDs that correspond to the reference gene ID

I've also tried the last suggestion (list) and I get the following errors:

line 2: unexpected EOF while looking for matching `"' line 4: syntax error: unexpected end of file

My file didn't even have a line 4

ADD REPLY

Login before adding your answer.

Traffic: 1866 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6