To match fragments within range - Python or awk
0
0
Entering edit mode
18 months ago
L_bioinfo • 0

I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

Input:

contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12

Desired output:

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

Code:

with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    gene_dict = {}
    for line in f_in:
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        if gene_name not in gene_dict:
            gene_dict[gene_name] = []
        gene_found = False
        for gene_frag in gene_dict[gene_name]:
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            if read_start <= frag_start <= read_end or read_start <= frag_end <= read_end:
                gene_found = True
                break
        if gene_found:
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
            fields.append(f"{gene_name}{gene_count}")
        else:
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f"{gene_name}")
        f_out.write("\t".join(fields) + "\n")

I am not getting the desired output. The output I got

contig  230  3888 read1  1 8397 Gene1 Gene1
contig  3343 3885 read2  1 8397 Gene1 Gene1
contig  6030 8257 read4  1 8397 Gene1 Gene1
contig  6030 8257 read10 1 8397 Gene1 Gene1
contig  6260 8257 read7  1 8397 Gene1 Gene1
contig  6030 8257 read8  1 8397 Gene1 Gene1
contig  6190 6345 read10 1 8397 Gene1 Gene1
contig   0   3124 read27 26 406 Gene12 Gene12
contig  100  200  read30 26 406 Gene12 Gene12

The file is tab-delimited although here I haven't constructed the example with tab delimiter. Please provide pointers on where I am going wrong.

awk python • 765 views
ADD COMMENT
0
Entering edit mode

and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

I don't understand

ADD REPLY
0
Entering edit mode

if read fragment is found within the gene length I want mark up as genename + an alphabet. with successive read fragment find within the same gene I want increment alphabet. I have depicted that in my desired output

ADD REPLY
0
Entering edit mode
gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)

I have a hard time wrapping my head around that ....

in general appending an a or a b to a name feels like an incorrect data model that you have in mind, I would suggest reformulating what you need and adding another column or finding something that marks the information more reliably. Even .1, .2 is better than just a that could be part of a valid name.

ADD REPLY
0
Entering edit mode

I want to add like a, b,... z. Similar to .1,.2 etc

ADD REPLY

Login before adding your answer.

Traffic: 2122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6