I am trying to running cuffmerge to use for relative gene expression down the pipeline. I acquired my reference genome from UCSC. I have run tophat on the data from two flow cell reads (taken from an Illumina HiSeq 2500 machine) e.g.
tophat \
-p 20 \
-o tophat_out_L001 \
--library-type fr-unstranded \
-r 75 \
--mate-std-dev 60 \
/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/BowtieIndex/genome \
/path/to/Data/Raw_Data_fastq/Sample01_L001_R1_001.fastq \
/path/to/Data/Raw_Data_fastq/Sample01_L001_R2_001.fastq
tophat \
-p 20 \
-o tophat_out_L002 \
--library-type fr-unstranded \
-r 75 \
--mate-std-dev 60 \
/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/BowtieIndex/genome \
/path/to/Data/Raw_Data_fastq/Sample01_L002_R1_001.fastq \
/path/to/Data/Raw_Data_fastq/Sample01_L002_R2_001.fastq
Then I merge the data, e.g.
samtools merge L001_L002/out.bam tophat_out_L001/accepted_hits.bam tophat_out_L002/accepted_hits.bam
I then sort the data, e.g.
samtools sort L001_L002/out.bam L001_L002/out_sorted
I then run cufflinks on it, e.g.
cufflinks \
-p 20 \
-o cufflinks_gene_expression \
--library-type fr-unstranded \
--GTF /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf \
--no-update-check \
L001_L002/out_sorted.bam
This all seems fine and well, until I run cuffmerge. All the data files are stored in Sample0*/cufflinks_gene_expression/*
:
cuffmerge \
-p 20 \
-o merged \
-g /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf \
-s /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/ \
assembly_GTF_list.txt
where assembly_GTF_list.txt
contains
Sample01/cufflinks_gene_expression/transcripts.gtf
Sample02/cufflinks_gene_expression/transcripts.gtf
Sample03/cufflinks_gene_expression/transcripts.gtf
Sample04/cufflinks_gene_expression/transcripts.gtf
Sample05/cufflinks_gene_expression/transcripts.gtf
Sample06/cufflinks_gene_expression/transcripts.gtf
Sample07/cufflinks_gene_expression/transcripts.gtf
Sample08/cufflinks_gene_expression/transcripts.gtf
Sample09/cufflinks_gene_expression/transcripts.gtf
Doing this I get the output:
[Tue Sep 15 10:43:29 2015] Beginning transcriptome assembly merge
-------------------------------------------
[Tue Sep 15 10:43:29 2015] Preparing output location merged/
[Tue Sep 15 10:43:37 2015] Converting GTF files to SAM
[10:43:37] Loading reference annotation.
[10:43:40] Loading reference annotation.
[10:43:42] Loading reference annotation.
[10:43:44] Loading reference annotation.
[10:43:46] Loading reference annotation.
[10:43:49] Loading reference annotation.
[10:43:51] Loading reference annotation.
[10:43:53] Loading reference annotation.
[10:43:56] Loading reference annotation.
[Tue Sep 15 10:43:58 2015] Quantitating transcripts
You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o merged/ -F 0.05 -g /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 20 merged/tmp/mergeSam_fileEJnLxC
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File merged/tmp/mergeSam_fileEJnLxC doesn't appear to be a valid BAM file, trying SAM...
[10:43:58] Loading reference annotation.
[10:44:01] Inspecting reads and determining fragment length distribution.
Processed 20978 loci.
> Map Properties:
> Normalized Map Mass: 298170.00
> Raw Map Mass: 298170.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[10:44:04] Assembling transcripts and estimating abundances.
Processed 20978 loci.
[Tue Sep 15 10:44:46 2015] Comparing against reference file /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf
You are using Cufflinks v2.2.1, which is the most recent release.
Warning: cannot find genomic sequence file /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/chr1_GL456221_random{.fa,.fasta}
Followed by a bunch of similar warnings.
My question, is when running cuffmerge should I point (using the -s
option) it to
/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/
or to
/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/
According to cuffmerge --help
-s/--ref-sequence <seq_dir>/<seq_fasta> Genomic DNA sequences for the reference.
If I point it to .../WholeGenomeFasta
, I get errors
Warning: cannot find genomic sequence file /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/chr1{.fa,.fasta}
Which is why I pointed it to the Chromosomes.
Thanks.
Is it in the required format?
Yes, I believe that it is in the correct format :
yields