I downloaded the full Homo_sapiens_Ensembl_GRCh37.tar.gz file from iGenomes (huge file, 17 GB, but contains everything I've needed otherwise for my tuxedo suite, from genomes to bowtie indexes), to use with my RNA-Seq pipeline. I have assembled the transcripts using tophat, then followed up with cufflinks to find expression values. I am now trying to use cuffmerge on this as described, with the following command in python:
command = "cuffmerge -p 8 -o merged -g %s -s %s assembly_GTF_list.txt"%(genes,refsequence_folder)
os.system(command)
Where "refsequence_folder" directs to the "Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes" folder, which contains the following .fa files:
10.fa 11.fa 12.fa 13.fa 14.fa 15.fa 16.fa 17.fa 18.fa 19.fa 1.fa 20.fa 21.fa 22.fa 2.fa 3.fa 4.fa 5.fa 6.fa 7.fa 8.fa 9.fa MT.fa X.fa Y.fa
My problem is the cuffmerge works well until it suddenly tries to look for .fa files that are not in this folder. Here is an excerpt from the error messages that I get:
Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000191.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000192.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000193.1{.fa,.fasta} [...]
Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1007_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1032_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG104_HG975_PATCH{.fa,.fasta} [...]
Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG2{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG5{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR12_1_CTG1{.fa,.fasta}
My question: where do I find all these missing files, if even iGenomes does not provide them? Alternatively, how do I get cuffmerge to stop looking for them?
I can add that I also attempted this by concatenating all the .fa files into a single hg19.fa file, and then providing cuffmerge with that file instead of the full folder. It wasn't quite that easy to fool cuffmerge :)
What was the reference file you've used for tophat assembly?
I used the Bowtie2 index files for Ensembl from http://cufflinks.cbcb.umd.edu/igenomes.html...