Entering edit mode
6.5 years ago
Gene_MMP8
▴
240
I want to download the neighborhood of mutations in a cancer genome. Say, I have a mutation at location 'x' in a cancer sample. Now I want to extract 3 bases to the left and right of location 'x'. Now I have some download speed issues, due to which it's taking a lot of time to download those huge bam files. Is there any faster and easier way to get the neighbourhood data? How can I do that without downloading the entire cancer dataset? Is there any querying tool that allows me to extract specific location in a cancer genome, without actually downloading the entire dataset?
3 bases of what ? the reference sequence ? the bam (which reads would you use ) ?
3 bases from the cancer bam file.
but what happens if one base is clipped, another read is REF, another read is a variation another read is a insertion etc ??? an why would you want to do that ?
Yeah makes sense. I am really confused! Can you suggest any other way to extract the neighborhood of the mutations? Previously I used to make a consensus sequence of the processed bam file and then write a python script to extract the neighborhood. But that requires downloading of the bam files. Is there a way around that?
What do I do if there are large INDELS present ? Eg:
Ref: A T G G C G C A
Tumor bam: A C G TTA C T C A
Now I am using the first ref file to extract two bases to the left and right of the 6th variant. So in the ref file I will get GC and CA, but in the tumour file it is AC and CA. So how to proceed here?
Hello banerjeeshayantan,
could you please tell a little bit more about what you are trying to do and why. This will help a lot to give proper assistance.
fin swimmer
Apologies for not making this clear. Actually, I wanted to extract 3 bases to the left and right of a given mutation. Initially, I thought of using the tumor bam file to achieve my required bases. But in another thread someone pointed out that this is unnecessary and just the Reference file and VCF file will be enough to fetch me the neighborhood bases. There is no such extra information that is present in the tumor.bam that is not already present in the Reference file. Also it is computationally expensive to query a bam file. So use the reference file instead.
Now I reasoned like my above reply:
Ref: A T G G C G C A
Tumor bam: A C G TTA C T C A
Now I am using the first ref file to extract two bases to the left and right of the 6th variant. So in the ref file, I will get GC and CA, but in the tumor file, it is AC and CA. So how to proceed here?
Extract sequences in tabular format using reference fasta and bed file. IMO, following is the way to take bases around variations:
Hello,
do you have information about whether the variants next to your variants are on the same strand? Is this information important? If you are sure that those variants are co-located or it doesn't matter you could build a consensus sequence for the region.
From the bcftools manual:
fin swimmer