I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.
Input:
contig 230 3888 read1 1 8397 Gene1
contig 3343 3885 read2 1 8397 Gene1
contig 6030 8257 read4 1 8397 Gene1
contig 6030 8257 read10 1 8397 Gene1
contig 6260 8257 read7 1 8397 Gene1
contig 6030 8257 read8 1 8397 Gene1
contig 6190 6345 read10 1 8397 Gene1
contig 0 3124 read27 26 406 Gene12
contig 100 200 read30 26 406 Gene12
Desired output:
contig 230 3888 read1 1 8397 Gene1 Gene1a
contig 3343 3885 read2 1 8397 Gene1 covered
contig 6030 8257 read4 1 8397 Gene1 Gene1b
contig 6030 8257 read10 1 8397 Gene1 covered
contig 6260 8257 read7 1 8397 Gene1 covered
contig 6030 8257 read8 1 8397 Gene1 covered
contig 6190 6345 read10 1 8397 Gene1 covered
contig 0 3124 read27 26 406 Gene12 Gene12a
contig 100 200 read30 26 406 Gene12 covered
Code:
with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
gene_dict = {}
for line in f_in:
fields = line.strip().split()
gene_name = fields[6]
read_start = int(fields[1])
read_end = int(fields[2])
gene_start = int(fields[4])
gene_end = int(fields[5])
if gene_name not in gene_dict:
gene_dict[gene_name] = []
gene_found = False
for gene_frag in gene_dict[gene_name]:
frag_start = gene_frag[0]
frag_end = gene_frag[1]
if read_start <= frag_start <= read_end or read_start <= frag_end <= read_end:
gene_found = True
break
if gene_found:
gene_dict[gene_name].append((gene_start, gene_end))
gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
fields.append(f"{gene_name}{gene_count}")
else:
gene_dict[gene_name] = [(gene_start, gene_end)]
fields.append(f"{gene_name}")
f_out.write("\t".join(fields) + "\n")
I am not getting the desired output. The output I got
contig 230 3888 read1 1 8397 Gene1 Gene1
contig 3343 3885 read2 1 8397 Gene1 Gene1
contig 6030 8257 read4 1 8397 Gene1 Gene1
contig 6030 8257 read10 1 8397 Gene1 Gene1
contig 6260 8257 read7 1 8397 Gene1 Gene1
contig 6030 8257 read8 1 8397 Gene1 Gene1
contig 6190 6345 read10 1 8397 Gene1 Gene1
contig 0 3124 read27 26 406 Gene12 Gene12
contig 100 200 read30 26 406 Gene12 Gene12
The file is tab-delimited although here I haven't constructed the example with tab delimiter. Please provide pointers on where I am going wrong.
I don't understand
if read fragment is found within the gene length I want mark up as genename + an alphabet. with successive read fragment find within the same gene I want increment alphabet. I have depicted that in my desired output
I have a hard time wrapping my head around that ....
in general appending an
a
or ab
to a name feels like an incorrect data model that you have in mind, I would suggest reformulating what you need and adding another column or finding something that marks the information more reliably. Even.1
,.2
is better than justa
that could be part of a valid name.I want to add like a, b,... z. Similar to .1,.2 etc