Dear all,
I would appreciate having your insight on the following situation when dealing with a bam file that contains paired end reads of ChIP-seq and a specific spike-in (sequence A).
I do see a discrepancy between the number of sequenceA reported by "samtools idxstats" versus "grep sequenceA". The question is why ?
A. shall I work with the entire bam file, that contained both Read1 and Read2 of paired-end pair (ALL reads), and spike-ins sequence A, and samtools idxstats produces the following output regarding the number of these sequences in the file:
*sequence_A :: 313668
B. Shall I separate the BAM file into 2 bam files that contain either FWD or REV reads, the results of samtools idxstats show 1/2 of the number of sequence A :
<> FWD reads :
"sequence _A : 156713
<> REV reads :
" sequence_A : 156955"
C. So far, so good. The question arises when I check the number of Sequence_A in the BAM files with ALL reads, FWD or REV reads, by using "grep" the numbers are :
<> ALL reads :
samtools view ALL_READS.bam | grep "Sequence A" | wc -l :: 130481
<> FWD reads :
samtools view FWD_READS.bam | grep "Sequence A" | wc -l :: 0
<> REV reads :
samtools view REV_READS.bam | grep "Sequence A" | wc -l :: 130481
**
The code that I am using is shown below :
samtools view -b -F 4 $FILE -o "${FILE%.bam}.map.bam"
samtools sort "${FILE%.bam}.map.bam" -o "${FILE%.bam}.map.sort.bam"
samtools index "${FILE%.bam}.map.sort.bam"
samtools idxstats "${FILE%.bam}.map.sort.bam" | tee "${FILE%.bam}.map.sort.idxstats" > /dev/null
samtools view -h "${FILE%.bam}.map.sort.bam" -o "${FILE%.bam}.map.sort.sam"
####################################################################################
#################################################################################### to separate into FORWARD strand
samtools view -F 0x10 "${FILE%.bam}.map.sort.bam" -bh -o "${FILE%.bam}.map.sort.fwd.bam"
samtools index "${FILE%.bam}.map.sort.fwd.bam"
samtools idxstats "${FILE%.bam}.map.sort.fwd.bam" | tee "${FILE%.bam}.map.sort.fwd.idxstats.txt" > /dev/null
####################################################################################
#################################################################################### to separate into REVERSE strand
samtools view -f 0x10 "${FILE%.bam}.map.sort.bam" -bh -o "${FILE%.bam}.map.sort.rev.bam"
samtools index "${FILE%.bam}.map.sort.rev.bam"
samtools idxstats "${FILE%.bam}.map.sort.rev.bam" | tee "${FILE%.bam}.map.sort.rev.idxstats.txt" > /dev/null
**
I believe in this case sequenceA would refer to e.g. a chromosome that the reads are aligned to (I say this because that is what idxstats does, it reports the total reads on each chromosome basically)
Yes, indeed, we use sequence A as an "additional chromosome" beside the usual chromosomes 1, ... 22, X and Y.
A question for our biostar administrators : is it OK if I post the same question on StackOverflow in order to learn other people's opinions ? Thank you.
If you feel that you will get useful answers then feel free to post at SO. Please reference this thread in your SO post so people are aware of the cross-posting. Perhaps add a link for the cross-posted thread here in original post as well.
I am not certain what exactly you are asking above. You only looked for forward and reverse reads but did you account for secondary/ supplementary alignments? Use of
grep
is simply to look for wordsequence A
nothing more correct?Thank you, yes, a good point about excluding secondary or supplementary alignments.
bogdan - what ethnicity ? are the samples of a different eth than the genomic ref?
VAL
Hi Vincent, it is a good question. I know that we work with human samples, but I am not sure about sample ethnicity. An information that I have not included in my message was that Sequence A, Sequence B, etc are barcode sequences that we find in .bam files. In this particular case, we would not expect SNPs in their sequences.
Bogdan,
this issue has clung to my memory. have you reached resolution?
VAL
Hi Vincent, yes, I will post the answers a bit later :)
oke we are waiting. :-)