Hi,
I have five BLASTn tabular files that resulted from querying the same large gene list (same query) against a different subject genome database for each resulting file. The goal was to potentially identify possible orthologous sequences between the subject gene list and the 5 different match genomes.
I was able to identify most, if not all of the same genes between each genome.
Now I would like to concatenate the five files together and sort them by the gene identifier name so that the sequences and names for the same gene across all five genomes are located in the same row of a different column. THe goal with this is that I can then extract the sequences for all five species across every gene for an alignment. I am working with a huge amount of genes here.
Is there a way to do this using cat and sort or would python work better? I am a bit clueless as to how to do this in python.
Thanks in advance, Zach
yes, what have you tried ?
What I'm not sure is if a gene is missing from one of the blast files, but present in the other four, wouldn't the genes not all line up across all five species?
I have not actually used cat and sort, but have been reading that this might work. Would you have any ideas of a possible script?
This is all I have done so far and it sorted all the sequences from the same species together, so I need to figure how to modify sort to sort by gene first, and then species. Not sure how to deal with the problem of genes missing in one genome, but present in the others.
cat outputExpandedPA.blast.txt outputExpandedGS.blast.txt outputExpandedGG.blast.txt outputExpandedFG.blast.txt outputExpandedCl.blast.txt > Combined.txt | sort
Is the output in one of the tabular blast output formats? If not, doing a simple cat/sort will not work.
I outputted the blast results in output format 7, the one that gives the actual sequences of both subject and match.
The sort worked, butI'm just not sure how to modify it to line up the sequences for each gene so I can extract the sequences for each species and then align the sequences.
Have you tried any Bio-* parsers? - http://biopython.org/DIST/docs/tutorial/Tutorial.html - http://search.cpan.org/dist/BioPerl/Bio/SearchIO/blast.pm
I am familiar with biopython, but have not used it for this task. Could you recommend a particular biopython function for this?
Thanks, Zach