Hi,I'm trying to follow the atacseq analysis workshop by NBIS. I directly download the bam file from ENCODE by
wget https://www.encodeproject.org/files/ENCFF045OAB/@@download/ENCFF045OAB.bam
I check the header,the file was produced by
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.3 CL:"/software/bowtie2-align-s --wrapper basic-0 -k 5 -X2000 --mm --threads 6 -x GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta -1 R1/ENCFF369XMA.trim.fastq.gz -2 R2/ENCFF997TXI.trim.fastq.gz"
I make a file that fits with workshop's statement "alignments were subset to include chromosome 14 and 1% of reads mapped to chromosomes 1 to 6 and chrM."
$SAMTOOLS view -hb -@ 3 bam/ENCFF045OAB.bam -o subset1/chr14/ENCFF045OAB_chr14.bam chr14
$SAMTOOLS view -hb -@ 3 bam/ENCFF045OAB.bam -o subset1/sample/ENCFF045OAB_other.bam chr1 chr2 chr3 chr4 chr5 chr6 chrM
$SAMTOOLS index subset1/chr14/ENCFF045OAB_chr14.bam
$SAMTOOLS index subset1/sample/ENCFF045OAB_other.bam
$SAMTOOLS view -s 101.01 subset1/sample/ENCFF045OAB_other.bam -b -o subset1/sample/ENCFF045OAB_downsample_1-6_M.bam
$SAMTOOLS merge -o subset1/ENCFF045OAB.chr14.bam subset1/chr14/ENCFF045OAB_chr14.bam subset1/sample/ENCFF045OAB_downsample_1-6_M.bam
$SAMTOOLS index subset1/ENCFF045OAB.chr14.bam
Then I filter the file with following key code,
$BAMUTILS filter subset1/ENCFF045OAB.chr14.bam subset1/ENCFF045OAB_rmblack_chr14.bam -excludebed ./ref/ENCFF356LFX.bed nostrand
$SAMTOOLS index subset1/ENCFF045OAB_rmblack_chr14.bam
$BAMUTILS filter subset1/ENCFF045OAB_rmblack_chr14.bam subset1/ENCFF045OAB_mtRM_rmblack_chr14.bam -excluderef chrM
$SAMTOOLS view -f 2 -q 5 -hb -o subset1/ENCFF045OAB_mapq5_mtRM_rmblack_chr14.bam subset1/ENCFF045OAB_mtRM_rmblack_chr14.bam
$SAMTOOLS index subset1/ENCFF045OAB_mapq5_mtRM_rmblack_chr14.bam
$PICARD MarkDuplicates I=subset1/ENCFF045OAB_mapq5_mtRM_rmblack_chr14.bam O=processedbam/ENCFF045OAB_rmdup_mapq5_mtRM_rmblack_chr14.bam \
M=processedbam/ENCFF045OAB.rmdup.metrics \
REMOVE_DUPLICATES=true \
ASO=coordinate
After filtering,when i try to plot insertsize by picard,a blue region ranging below about 100bp occurs.
picard CollectInsertSizeMetrics I= ENCFF045OAB_rmdup_mapq5_mtRM_rmblack_chr14.bam O=test.stats H=test2.pdf M=0.05
It seems that there are "RF"/outward reads in my bam file.I check the "test.stats" and find the "READ_PAIRS" column is "260038 RF " and " 1180113 FR ".Besides, command "samtools flagstats ENCFF045OAB_rmdup_mapq5_mtRM_rmblack_chr14.bam" gets an entry with "1440151 + 0 read2", so picard may use read2 for collect insertsize.Another command agrees with that."721643 + 718508 = 1440151"
$ samtools view ENCFF045OAB_rmdup_mapq5_mtRM_rmblack_chr14.bam |cut -f 2|sort | uniq -c
721643 147
718508 163
103360 323
118153 339
118610 355
103156 371
103152 387
118611 403
118155 419
103362 435
718509 83
721644 99
So the question is that if picard only use 2nd strand and primary alignment(147 +163) for cauculate insertsize why is there blue region? According to the definition,they should all be "FR".
Set "M=0.5" can make the plot beautiful because "RF" reads are removed,but they are still alive.
I have searched the topic and find a same post,but i am still confused how does it happen.
When I try to keep only good pair reads by "-F 256",I get the same blue region as expected(picard still choose flags "147 163" for plot).But if the blue region indicates outward reads,how does picard define it?
$ samtools view -F 256 ENCFF045OAB_rmdup_mapq5_mtRM_rmblack_chr14.bam -o good.bam
$ picard CollectInsertSizeMetrics I=good.bam O=test.stats H=test5_good.pdf M=0.05