I have a file which is in the following format : (PS : This is RNA Seq data)
chrID start stop Gene Sample1_ExonCount Sample2_ExonCount Sample3_ExonCount
chr1 300 350 ABC 200 187 167
chr1 400 512 ABC 112 110 105
chr1 534 587 ABC 87 76 55
chr1 612 687 PQR 12 15 24
chr1 812 898 PQR 10 13 12
chr2 .... ..... ..... ..............................................................
and so on.
Basically, I have to parse this file and calculate some statistics. For example, the first 3 records in this file are the exons of the gene ABC. (i.e ABC has 3 exons). And the counts are (I guess) the number of reads mapping to these exons in each of the samples (1, 2 and 3).
What I have to do is to initially create a "Size" column and calculate the size of each exon. For eg. the first record (i.e the first exon of gene ABC has size 350 - 300 = 50). I have to iterate through every record and calculate the size in this way.
Now comes the tricky part. FOr each gene, I have to add up the sizes. So, for gene ABC, I have 50 (1st record) + 112 (2nd record) + 53 (3rd record) = 215. And then divide this number by total exon counts in each sample. i.e for gene ABC and sample 1 it wil be 215/(200+112+87) = 215/399.
Are there any fast computational techniques by which I can do this. Are there any modules available in Python specifically with which I can parse these files fast ? I am just a beginner with programming and I would really like to know. The last time I parsed such a file, it had serious performance issues.
Thanks.
Can you show the code you used last time to parse the file that had the performance issues? Maybe we can optimize it.