What if I split the BAM file by chromosome using following command samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam and then Cufflinks is done. Would it produce false positive? Can somebody help me understand or give me some reference article? Actually it has been hampering my work for long and I would be extremely grateful.
Your post has already been answered, and I agree with Andrew (a sentiment I echo in our other thread), but I believe that it also depends on the type of experiment that you're doing: The one part where you should not split by chromosome is if you are looking to use the normalised FPKM (or geometric) values from Cufflinks. Most people don't use these values from Cufflinks, and instead use Cufflinks for just producing an individual transcriptome reference GTF per sample. In this case, I cannot see how spliting the BAM and then anlysing separately is an issue. I would alsosplit any reference/guide GTF that you're supplying, too.
This guy agrees: http://seqanswers.com/forums/showthread.php?t=8866
This person has evidence that splitting is not ok: https://groups.google.com/forum/#!topic/tuxedo-tools-users/59Tk3qZlbp8
I would encourage you to run an experiment with a BAM file separely for chr1 and chr2, and then a third time using chr1+chr2. Your system should be capable to perform this. Let us know what results you find.
I run Cufflinks with BAM file for chr1 and chr2 separately and then a third time with BAM file of chr1+chr2 and the number of processed loci for chr1 is 46184 and for chr2 is 44979. For chr1+chr2, it 46184+44979=91163
Thanks for the test! Did you use the -G parameter and provide a reference GTF?
Are the FPKM values in the results files the same?
I did not use the -G parameter, my command was: Cufflink -p 4 -o out.put BAM. FPKM values were different
Yes, that makes sense. The FPKM values are different because the read counts are normalised, in part, depending on the distribution of counts across all chromosomes.
So, the ultimate question for you is this: Why did you originally want to use TopHat/Cufflinks?
If you just want to identify novel transcripts and then use another program to determine raw counts over these, then you can split the BAM and identify transcripts independently per chromosome (and merge them into a single transcriptome reference GTF at the end)
If you intend to use Cuffdiff, Cuffnorm, etc., then you cannot feasibly split the BAM by chromosome
As an example for you: I recently conducted an experiment where I just wanted to identify novel transcript regions in my FASTQ files. I ran TopHat2/Cufflinks independently on each, and produced a new consensus reference GTF. From this, overlaying these new identified transcripts with my reference FASTA genome, I then produced a new reference transcriptome in FASTA (using the
gffread
function in cufflinks). I then used Kalliso to determine raw counts in my samples over this new reference transcriptome.In my example, I could have split the files by chromosome.
Thank you Kevin! My aim is to find novel transcript so I can split the BAM by chromosome.