Motivation
The ENCODE data comes out, and luckily they provide both .bam
file and .bigwig
file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools
and other related tools.
Result
I'll first upload the difference between my-version and official version:
Top to Bottom:
- Black: my-version-POSitive-strand.bigwig
- Blue: Official-version-POSitive-strand.bigwig
- Red: Official-version-REVerse-strand.bigwig
- Grey: my-version-REVerse-strand.bigwig
From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.
Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.
Data Set
ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:
- BAM file ready for my Data Processing;
- bigWig Positive signal file ready for uploading to UCSC;
- bigwig Reverse signal file ready for uploading to UCSC.
Data Processing
BAM sort
samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort
Genome Coverage
I refer to the standard manual of BEDtools
, I'll use forward strand as example, and the reverse strand signal is generated in the same way.
genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph
Note that I've used -strand
flag to separate the two strands.
bedgraphtoBigWig
bedGraphToBigWig
executive script available from UCSC exe list
bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig
Upload to ftp and finally to UCSC genome browser.
Discussion
I was wondering which filtering step I've missed.
I've checked whether all the reads in the .bam
file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.
Thus, any suggestions?
Welcome to BioStar. This is a great post. Well organized and formatted with step-by-step descriptions of what you did, what the problem is, and how you are currently thinking about the solution. Kudos for answering your own question and providing the update here!