Python Script To Format Blast Output To Get Number Of Duplicated Genes
1
0
Entering edit mode
11.1 years ago
armahmud3 • 0

How to format blastn output file to get number of duplicated genes within a genome. Output file will be number of duplicated genes at various % of sequence identity ( 0 to 100%, increment of 10). I need a python script to write this and output file should consist 10 lines and save as csv file.

Example of output file % sequence identity and gene alignment (before comma) and number of duplicated genes (after comma)

[0,13456] [10,234] [20,234] [30,200] [40,190] [50,187] [60,187] [70,100] [80,95 ] [90,55] [100,45]

please kindly help me. Thanks in advance.

python genes • 5.7k views
ADD COMMENT
0
Entering edit mode

I remember that you asked similar question earlier. Could you please provide some examples input file ?

ADD REPLY
0
Entering edit mode

I am in a serious problem regarding this python script. Don't get solution yet and this is why I am asking repeatedly for help in this forum. Input file should be a single genome file in genbank format like NC_011000.gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Burkholderia_cenocepacia_J2315_uid57953/ I am a beginner in programing language. I try to solve this problem by reading biopython tutorial , cookbook. unfortunately I did not write the script yet.

ADD REPLY
0
Entering edit mode

Why you want to start task from .gbk files ? Why not .fasta files ? Just wanna reconfirm that you were saying about BLASTN ?

ADD REPLY
0
Entering edit mode
11.1 years ago
David W 4.9k

I presume you are trying to BLAST a set of genome sequences against themselves, and find apparent paralogs?

In that case you first need to run blastn locally, NCBI gives you loads of documentation on this. If you want to use Biopython to do the processing then have the blast executable spit out an xml file (-outfmt= 5).

You should read the Biopython tuotrial section on BLAST parsing to get an idea of what each blast record is representing. But to get results on the top hit for each record you could do something like

from Bio.Blast import NCBIXML

recs = NCBIXML.parse(open("path_to_your_blast.xml"))

def score_top_hit(rec):  
    """ """                               
    if rec.alignments:
        top_hit = rec.alignments[0].hsps[0]
        score =  float(top_hit.identities)/ top_hit.align_length
    else:
        score = 0.0
    return score

  scores = [score_top_hit(r) for r in recs]

There is almost certainly a more memory/time efficient way of getting the number of sequences above a given cut than this, but it'll work

cuts = [i/100.0 for i in range(0,110, 10)] #hacky way of making a range of floats - must be nicer way!
zip(cuts,  [ sum([c > s for s in scores]) for c in cuts])

EDIT to cover some of armahmud3's comments below

Hi armahmud3. I gather this is a homework question. It's great that your teacher encourages you to seek help form others, and Biostar can certainly help you learn bioinformatics. But I don't think anyone is served by providing with a "copy paste" a solution to your problem. Instead, I'll give you some pointers about how to attack the problem.

From your problem statement: you want to BLAST a set of gene sequences against each other, collect some information form the resulting output, and summarise it in a .csv file. Although Biopython does have "wrapper" scripts to help you run BLAST locally, you don't need to use them. Instead you can simply use the BLAST executables form the commandline. The NCBI's documentation of the BLAST programmes is very good. You'll want to download your sequences, use makeblastdb to create a database and blastn to run your search.

The code I gave you above will count up the number of sequences where the "top hit" is above or below each threshold. You should also pay around a bit with the BLAST parser, to get a feel of what each record represents. In an interactive session, check out:

>>> recs = NCBIXML.parse(open("path_to_your_blast.xml"))
>>> r = recs.nex()
>>> dir(r)
>>> r.alignments

Finally you want to spit the results out as .csv file. Read up on python input and output handling to see how to do this.

ADD COMMENT
0
Entering edit mode

Thank you so much David W. If you don't mind could you please kindly give me a example of biopython script to blast a set of genome sequences against themselves? A tutorial script will be very helpful for me. Another is what will be the command line in python if i want to save the output file as .csv format from your aforementioned script? David, sorry for long reply but please kindly look at the following problem, actually this is the main problem which i try to fragmented and want to solve.

Problem: A homologous gene group consists of genes, which have an alignment to at least one other gene in the group with both a % sequence coverage and % identity equal to or higher than a given stringency cutoff. Write a python program to find the number of homologous gene groups within a genome at various levels of stringency. For example with a stringency of 10, every gene in a group should have a % sequence coverage and % identity of at least 10% with at least one other gene in the group. The program should take a single genome file in genbank format on the command line and write a csv file called "output.csv" in the directory. The ouput file should consist of 10 lines, each line should have the stringency used (0 to 100, in increments of 10) and the number of homology groups. (See example below). Submit the python program as a txt file.

[0,13456] [10,234] [20,234] [30,200] [40,190] [50,187] [60,187] [70,100] [80,95 ] [90,55] [100,45]

"Stringency cutoff was defined by % of coverage in the gene alignment and % sequence identity"

I have full permission to help from peer, mentor, bioinformatics forum and sites. I need your valuable suggestions regarding this problem.

ADD REPLY

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6