I uploaded a single FASTA file with multiple gene clusters from different organisms to an online program called antiSMASH, or fungiSMASH in my case. Most clusters have a ketosynthase (KS) gene in them. antiSMASH identified the putative KS genes and provided me with an output in a text file. Can anyone assist me in extracting (parsing may be the correct term?) the genes of interest (nucleotide sequences) with associated accession number and definition, or just the taxon name from the definition?
I believe all of the genes of interest will say:
/aSDomain="PKS_KS"
And if this is present then I will want the range of nucleotides indicated adjacent to the heading aSDomain. For example:
aSDomain 2527..3816
However, sometimes the adjacent numbers will says something like:
aSDomain join(1610..1702,1756..2109,2165..3010)
In which case I believe I would want to concatenate each range indicated.
Or:
aSDomain complement(join(10640..10648,10717..11439))
In which case I believe I would want to concatenate all ranges and then take the complementary sequence.
I would like to do an alignment and then make a phylogenetic tree based on the extracted KS genes. I believe FASTA format would be a good output to have my KS genes in, but I can convert if necessary.
This is my first time using antiSMASH and I'm new to coding so I apologies for any obvious blunders and I would have preferred to attach a file of my output data but I didn't see that as an option! Thanks in advance for any help!
Here's a link to my output:
https://drive.google.com/open?id=1KWbh3D7jY7u5AytlGCLxC65MR_KKUY7X
If someone has a better way of attaching a large text file (~1.7 million characters), I'm all ears.
Here's a link all of the output that antiSMASH generated (not just text file of all annotated genes):
https://drive.google.com/open?id=1KWbh3D7jY7u5AytlGCLxC65MR_KKUY7X
Thank you for your answer SMK! This code worked very well. There's one more piece of information I'd like included, the genus and species names included under "DEFINITION". For example, from this line I'd like just the words Cladonia grayi so that later on my final phylogenetic tree can include the taxa.
You can change the codes to:
Which gives me:
This worked great, thank you again SMK.
One more question if you don't mind. The previous results were generated by the online version of antiSMASH. I am now running a package version of antiSMASH from bash commandline and the output is different. I tried to change your code to process this slightly different output but I was unable to find a good solution. Would you mind helping me to make these changes?
Here's the output that I am now working with: https://drive.google.com/open?id=1FT5r6QPAutEcDCgEvgu9O141H3w6L_5W
You're welcome. Changing from
if feature.qualifiers["aSDomain"][0] == "PKS_KS":
toif feature.qualifiers["domain"][0] == "PKS_KS":
should work again.If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.