Question

Using cufflinks : warning cannot find genomic sequence

2

Entering edit mode

9.2 years ago

irritable_phd_syndrome ▴ 130

I am trying to running cuffmerge to use for relative gene expression down the pipeline. I acquired my reference genome from UCSC. I have run tophat on the data from two flow cell reads (taken from an Illumina HiSeq 2500 machine) e.g.

tophat \
  -p 20 \
  -o tophat_out_L001 \
  --library-type fr-unstranded \
  -r 75 \
  --mate-std-dev 60 \
  /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/BowtieIndex/genome \
  /path/to/Data/Raw_Data_fastq/Sample01_L001_R1_001.fastq \
  /path/to/Data/Raw_Data_fastq/Sample01_L001_R2_001.fastq

  tophat \
  -p 20 \
  -o tophat_out_L002 \
  --library-type fr-unstranded \
  -r 75 \
  --mate-std-dev 60 \
  /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/BowtieIndex/genome \
  /path/to/Data/Raw_Data_fastq/Sample01_L002_R1_001.fastq \
  /path/to/Data/Raw_Data_fastq/Sample01_L002_R2_001.fastq

Then I merge the data, e.g.

samtools merge L001_L002/out.bam tophat_out_L001/accepted_hits.bam tophat_out_L002/accepted_hits.bam

I then sort the data, e.g.

samtools sort L001_L002/out.bam L001_L002/out_sorted

I then run cufflinks on it, e.g.

cufflinks \
  -p 20 \
  -o cufflinks_gene_expression \
  --library-type fr-unstranded \
  --GTF /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf \
  --no-update-check \
  L001_L002/out_sorted.bam

This all seems fine and well, until I run cuffmerge. All the data files are stored in Sample0*/cufflinks_gene_expression/*:

cuffmerge \
  -p 20 \
  -o merged \
  -g /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf \
  -s /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/ \
  assembly_GTF_list.txt

where assembly_GTF_list.txt contains

Sample01/cufflinks_gene_expression/transcripts.gtf
Sample02/cufflinks_gene_expression/transcripts.gtf
Sample03/cufflinks_gene_expression/transcripts.gtf
Sample04/cufflinks_gene_expression/transcripts.gtf
Sample05/cufflinks_gene_expression/transcripts.gtf
Sample06/cufflinks_gene_expression/transcripts.gtf
Sample07/cufflinks_gene_expression/transcripts.gtf
Sample08/cufflinks_gene_expression/transcripts.gtf
Sample09/cufflinks_gene_expression/transcripts.gtf

Doing this I get the output:

[Tue Sep 15 10:43:29 2015] Beginning transcriptome assembly merge
-------------------------------------------

[Tue Sep 15 10:43:29 2015] Preparing output location merged/
[Tue Sep 15 10:43:37 2015] Converting GTF files to SAM
[10:43:37] Loading reference annotation.
[10:43:40] Loading reference annotation.
[10:43:42] Loading reference annotation.
[10:43:44] Loading reference annotation.
[10:43:46] Loading reference annotation.
[10:43:49] Loading reference annotation.
[10:43:51] Loading reference annotation.
[10:43:53] Loading reference annotation.
[10:43:56] Loading reference annotation.
[Tue Sep 15 10:43:58 2015] Quantitating transcripts
You are using Cufflinks v2.2.1, which is the most recent release.
Command line:
cufflinks -o merged/ -F 0.05 -g /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 20 merged/tmp/mergeSam_fileEJnLxC
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File merged/tmp/mergeSam_fileEJnLxC doesn't appear to be a valid BAM file, trying SAM...
[10:43:58] Loading reference annotation.
[10:44:01] Inspecting reads and determining fragment length distribution.
Processed 20978 loci.    
> Map Properties:
>   Normalized Map Mass: 298170.00
>   Raw Map Mass: 298170.00
>   Fragment Length Distribution: Truncated Gaussian (default)
>                 Default Mean: 200 
>              Default Std Dev: 80
[10:44:04] Assembling transcripts and estimating abundances.
Processed 20978 loci.    
[Tue Sep 15 10:44:46 2015] Comparing against reference file /path/to/Data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf
You are using Cufflinks v2.2.1, which is the most recent release.
Warning: cannot find genomic sequence file /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/chr1_GL456221_random{.fa,.fasta}

Followed by a bunch of similar warnings.

My question, is when running cuffmerge should I point (using the -s option) it to

/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/

or to

/path/to/Data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/

According to cuffmerge --help

-s/--ref-sequence      <seq_dir>/<seq_fasta> Genomic DNA sequences for the reference.

If I point it to .../WholeGenomeFasta, I get errors

Warning: cannot find genomic sequence file /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/chr1{.fa,.fasta}

Which is why I pointed it to the Chromosomes.

Thanks.

RNA-Seq cufflinks • 2.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by irritable_phd_syndrome ▴ 130

0

Entering edit mode

Is it in the required format?

Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

Yes, I believe that it is in the correct format :

ls /path/to/Data/Mus_musculus/UCSC/mm10/Sequence/Chromosomes/

yields

chr10.fa  chr12.fa  chr14.fa  chr16.fa  chr18.fa  chr1.fa  chr3.fa  chr5.fa  chr7.fa  chr9.fa  chrX.fa
chr11.fa  chr13.fa  chr15.fa  chr17.fa  chr19.fa  chr2.fa  chr4.fa  chr6.fa  chr8.fa  chrM.fa  chrY.fa

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by irritable_phd_syndrome ▴ 130

score 0 · Answer 1 · 2015-09-15

0

Entering edit mode

9.2 years ago

h.mon 35k

Your gtf files have references to "chromosomes" for which you do not have fasta files. See here.

ADD COMMENT • link 9.2 years ago by h.mon 35k