Question

using python script to analyze ht-seq data

0

Entering edit mode

8.2 years ago

mdgallagher71 ▴ 20

Hi, I have been struggling with analyzing chromosome conformation capture (3C) data for almost 6 months now. We have BAM files containing the aligned reads to the reference genome, and we are using a custom python script to extract reads containing SNPs so that we can analyze the information in an allele-specific manner. I don't think understanding the molecular biology of what we're doing is important, because all the issues are technical in nature.

Basically what is happening is that we at first could not analyze half the samples using the script, and these samples gave an error message that says "IndexError: string index out of range", after running the script to analyze the bam files.

We were able to fix this issue, such that all samples could be successfully analyzed, by adding a new test to the script (by test I mean a filter/requirement that a read must pass in order to not be thrown out).

The new test requires that the alignment length matches the inferred alignment length from the CIGAR string. This script was written completely by a collaborator, so we don't know if this test makes sense, or why it should allow us to analyze the problematic BAM files. However, I think the basic idea is just to make sure that the alignments are trustworthy.

The problem, however, is that for the majority of alignments, the CIGAR string appears to not have enough information to correctly infer the alignment length, and when this happens, the CIGAR string spits out 125bp as the default inferred alignment length, because 125bp is the READ length we used in the experiment. Since the vast majority of reads have soft clipping, less than 125bp are included in the alignment, and because this disagrees with the prediction of the CIGAR string, the reads get tossed.

When we remove this alignment length test from the script we recover all the reads that we have lost, but we go back to the original problem of not being able to analyze several of the files due to the IndexError message shown above. This leads me to believe that something is wrong with the reads and/or alignments in these samples, but I'm not sure exactly what to look for or how to do it.

Any help would be greatly appreciated!!!!!!

SNP software error next-gen alignment python • 2.0k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 164k • written 8.2 years ago by mdgallagher71 ▴ 20

1

Entering edit mode

If possible, share the problematic bam file ( subset of reads that are causing problem ) and the python script. Without looking at the script, its difficult to give any suggestions.

Other option is to state your biological problem and get suggestion regarding alternate ways of dealing with the problem.

ADD REPLY • link 8.2 years ago by GouthamAtla 12k

1

Entering edit mode

I suspect that there are a number of things going wrong here. For example, I would think that filtering out soft-clipped alignments is the exact wrong thing to do, since in some cases the soft-clipped portion is presumably informative of the chromosome conformation (i.e., you're removing the actual part of the signal that you care most about). I agree with Goutham Atla, if we can see the python script(s) and can be told generally how the data was processed up to this point then we can be more helpful.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks to both of you, but we have fortunately figured out the problem, and now the samples run without error.

As a more general question, though, do you guys/girls think the length test itself (requiring that the inferred alignment length from the CIGAR string match the alignment length in the BAM file) is a reasonable thing to do? Or unnecessary?

Thanks again

ADD REPLY • link 8.2 years ago by mdgallagher71 ▴ 20

0

Entering edit mode

I think that test is an entirely unreasonable thing to do.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k