Forgive me. I'm probably about to sound quite stupid.
I'm trying to help my partner learn a bit of python for her studies. It's going ok but now I want to put it to practical use for her.
I noticed she was doing some work on a program called IGV (I think?) where she has a FASTA file and the BAM file and it created a graph and then she was going in individually at each point where a difference was highlighted between the sequences and noting down the C or GC percentage change between the her BAM results file and the FASTA ref file.
I thought to be myself, this must be doable in Python or something, and not its hooked me a bit because I can't find the answer...
I can import SAM/BAM etc using pysam and I can read the FASTA file using BioPython. But now that I'm having a bit of a play around and a google...I can see no where on the internet where this is even a thing. I'm wondering if something has been lost in translation between her and her teacher (English is not her first language)
I've asked her to email for some clarification, but what with Covid, its taking a while to get replies, so in the mean time I though I might post to see if someone can enlighten me if this is actually something or not :)
I thought I could just take the relevant part of the genome sequence, and the corresponding part of the BAM file and compare the string using loops etc and calculate with no issue, but it seems I can't create a sequence string when parsing the bam file because of all the overlaps etc.
Any ideas? Apologies if I haven't explained things the best, its almost 3rd hand information at the moment! Hoepfully I can get a better idea this evening :)
Thanks,
Lawrence
Thanks for quick reply, I'm slowly getting there. I will try your key words "variant calling"... thanks again :)
Would recommend vcftools for general command line use, or if you really want to use python PyVCF is another option