Question

To match fragments within range - Python or awk

0

Entering edit mode

18 months ago

L_bioinfo • 0

I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

Input:

contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12

Desired output:

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

Code:

with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    gene_dict = {}
    for line in f_in:
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        if gene_name not in gene_dict:
            gene_dict[gene_name] = []
        gene_found = False
        for gene_frag in gene_dict[gene_name]:
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            if read_start <= frag_start <= read_end or read_start <= frag_end <= read_end:
                gene_found = True
                break
        if gene_found:
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
            fields.append(f"{gene_name}{gene_count}")
        else:
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f"{gene_name}")
        f_out.write("\t".join(fields) + "\n")

I am not getting the desired output. The output I got

contig  230  3888 read1  1 8397 Gene1 Gene1
contig  3343 3885 read2  1 8397 Gene1 Gene1
contig  6030 8257 read4  1 8397 Gene1 Gene1
contig  6030 8257 read10 1 8397 Gene1 Gene1
contig  6260 8257 read7  1 8397 Gene1 Gene1
contig  6030 8257 read8  1 8397 Gene1 Gene1
contig  6190 6345 read10 1 8397 Gene1 Gene1
contig   0   3124 read27 26 406 Gene12 Gene12
contig  100  200  read30 26 406 Gene12 Gene12

The file is tab-delimited although here I haven't constructed the example with tab delimiter. Please provide pointers on where I am going wrong.

awk python • 765 views

ADD COMMENT • link 18 months ago by L_bioinfo • 0

0

Entering edit mode

and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

I don't understand

ADD REPLY • link 18 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

if read fragment is found within the gene length I want mark up as genename + an alphabet. with successive read fragment find within the same gene I want increment alphabet. I have depicted that in my desired output

ADD REPLY • link 18 months ago by L_bioinfo • 0

0

Entering edit mode

gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)

I have a hard time wrapping my head around that ....

in general appending an a or a b to a name feels like an incorrect data model that you have in mind, I would suggest reformulating what you need and adding another column or finding something that marks the information more reliably. Even .1, .2 is better than just a that could be part of a valid name.