Setup
I have written an algorithm that works with breast cancer data from the TCGA database, and I am having trouble creating tests for it. Below I have posted some necessary background information along with specific questions.
Background
I currently have code that takes in a set of non-cancerous (i.e. healthy) BAM files, cancerous BAM files, and spits out some results dictionary. In the sequel I shall refer to this code as Black Box.
Black Box does a number of things with the BAM files, all of which I would like to test in one fell swoop. My idea for testing is to create a number of “truncated dummy_BAM files” which I then enter into Black_Box, and see if the results match my desired outcome. More precisely I want to write a “black_box_test” that takes in:
- Specifications for truncated dummy BAM files along with real bam headers
- My algorithm (i.e. the black box linked to in the image attached to the footnote)
- A desired results dictionary,
and outputs whether or not the the results dictionary generated by my black box is the same as desired results dictionary. I currently have code for the Black Box Test, which uses samtools-view, grep, head, and cat, to create truncated dummy BAM files with values I have specified and headers that correspond to real headers. In particular my algorithm, does the following:
- For both healthy and cancerous data, my takes a set of real BAMs corresponding to some {real_BAM_1,..,.real_BAM_k} along with specifications {dummy_BAM_info_1,..., dummy_BAM_info_k}, and creates dummy BAM files with headers from the real healthy_BAM and information.
- My test makes sure that both the healthy BAM headers and the cancerous BAM headers correspond to the healthy BAM headers and cancerous BAM headers of the same individual.
- My test insures that the healthy BAM, and the position list fed into mpileup share multiple loci (so the resulting mpileup should not be empty).
- My test insures that map quality is not a factor in the analysis, since it since all mapping quality values 255 (i.e mapping quality does not exist).
Unfortunately, my black box test code is giving me some unexpected behavior. In particular whenever I use samtools mpileup on a truncated dummy BAM and a corresponding position list that should output an mpileup with several lines, I get an empty mpileup. While trying to debug this problem I used Picard's ValidateSamFile function on the dummy SAM, that the dummy BAM is based on and I got the following error:
HISTOGRAM java.lang.String
Error Type Count
WARNING:MISSING_TAG_NM 5
WARNING:RECORD_MISSING_READ_GROUP 5
I am new to bioinformatics, and even after some googling , I was unable to figure out what exactly this "missing read group" error message meant. I figured that it meant that my header specified that there are read groups in the BAM file that never appeared in the BAM file.
Questions
- Does the general outline of my desired testing framework make sense? Is there any extra information I should provide to make it clearer?
- Do I have the correct interpretation of the Validate SAM file Error Message?
- Is this validate SAM file error the reason why I am getting an empty mpileup? In particular, is it the case that I can't run samtools mpileup on a BAM that is generated from an invalid SAMfile?
- Is there an easy way to fix this Validate SAM file problem? Also are these errors I am getting, a sure sign that my code that creates dummy BAMs is wrong?
- How does samtools mpileup, when given a single BAM and a position list (In particular the command I use shown below), filter values in the output file? I thought that it just simply includes all of the position and chromosomes and positions in the file list and disregards everything else. Is this correct? I was not sure if it perhaps included other information like map quality. So in my tests, I filtered all of my mapqs to 255 just in case.