Hello everyone,
I have some RNA-seq (Illumina/ single-end) data to analyze, provided by my boss and I have one problem with the mapping using RNA-STAR.
So far, I am able to use Fastqc, some trimming, generating my genome, but I have a problem with the mapping step. I would like to get the number of read for each gene (and then I want to use Deseq2 with R), but I don't have any file created with these information after running my script, and my Log.out file does not really help me this time. I have the following files created: Log.final.out, Log.out and Log.progress but not any BAM/SAM file.
When I have a look in the log, it is said that the mapping is done, but nothing else...
I have to admit that I am a bit lost and if someone could help me that would be great. I guess that the problem comes from the generation of the genome (a wrong annotation file?) or one of my parameter during the mapping.
Thank you and have a nice day! :)
This is how I generated my *genome *(All the files come from ensembl, genome GRCm38):
#===== BUILD GENOME INDEX
cd $pathINDEX
cp MOUSE_index_GRCm38.fa $SNIC_TMP #This one contains the merged fasta sequences of the chromosomes.
cd $pathGTF
cp Mus_musculus.GRCm38.83.chr.gtf.gz $SNIC_TMP #The file containing the annotations
cd $SNIC_TMP
gzip -d Mus_musculus.GRCm38.83.chr.gtf.gz
STAR --runMode genomeGenerate --genomeDir $pathFINAL --runThreadN 5 --sjdbGTFfile Mus_musculus.GRCm38.83.ch
echo "########## END ##########"
And the mapping:
#===MAPPING
gzip -d Sca1_CD51_ko_131230_CACCTC_L003_R1_001.fastq.gz
STAR -- runThreadN 5 -- genomeDir $pathINDEX -- readFilesIn ./ MYFILE.fastq
echo "########## END ##########"
What is written in Log.out :
Finished loading and checking parameters
Reading genome generation parameters:
versionGenome 20201 ~RE-DEFINED
genomeFastaFiles MOUSE_index_GRCm38.fa ~RE-DEFINED
genomeSAindexNbases 14 ~RE-DEFINED
genomeChrBinNbits 18 ~RE-DEFINED
genomeSAsparseD 1 ~RE-DEFINED
sjdbOverhang 100 ~RE-DEFINED
sjdbFileChrStartEnd - ~RE-DEFINED
sjdbGTFfile Mus_musculus.GRCm38.83.chr.gtf ~RE-DEFINED
sjdbGTFchrPrefix - ~RE-DEFINED
sjdbGTFfeatureExon exon ~RE-DEFINED
sjdbGTFtagExonParentTranscripttranscript_id ~RE-DEFINED
sjdbGTFtagExonParentGene gene_id ~RE-DEFINED
sjdbInsertSave Basic ~RE-DEFINED
Genome version is compatible with current STAR version
Number of real (reference) chromosmes= 21
1 10 130694993 0
2 11 122082543 130809856
3 12 120129022 252968960
4 13 120421639 373293056
5 14 124902244 493879296
6 15 104043685 618921984
7 16 98207768 722993152
8 17 94987271 821297152
9 18 90702639 916455424
10 19 61431566 1007419392
11 1 195471971 1069023232
12 2 182113224 1264582656
13 3 160039680 1446772736
14 4 156508116 1606942720
15 5 151834684 1763704832
16 6 149736546 1915748352
17 7 145441459 2065694720
18 8 129401213 2211184640
19 9 124595110 2340683776
20 X 171031299 2465464320
21 Y 91744705 2636644352
--sjdbOverhang = 100 taken from the generated genome
Started loading the genome: Fri Feb 19 18:37:29 2016
checking Genome sizefile size: 2779951453 bytes; state: good=1 eof=0 fail=0 bad=0
checking SA sizefile size: 22265276075 bytes; state: good=1 eof=0 fail=0 bad=0
checking /SAindex sizefile size: 1565873619 bytes; state: good=1 eof=0 fail=0 bad=0
Read from SAindex: genomeSAindexNbases=14 nSAi=357913940
nGenome=2779951453; nSAbyte=22265276075
GstrandBit=32 SA number of indices=5397642684
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 2779951453 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 2779951453 bytes
SA file size: 22265276075 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 22265276075 bytes
You should not put a space between the
--
and the parameter name.What is written in the Log.final.out ?
Hello, and first thank you for your answer :) .
I will do as you recommend, removing the spaces after the "--" and trying to use featureCounts.
This is what I've got in my Log.final.out. It seems that the mapping does not work, that's why I am wondering if my genome is properly generated (wrong GTF file?).
Try that for the mapping:
And to have a count table you should use featureCounts on your sam/bam file generated by STAR
Hello,
It workded, now I have a out.sam file created! Thanks again, Now I will use featureCounts as you said on my sam file.
However if I can ask a last thing. I just realized that the pipeline that I am using in not performing any assembly of the reads. My reads are shorts (~50) and directly mapped to the genome. Is it a problem? For my analyses, we're just planning to get the number of reads and use Deseq.
Best regards,
Here is my log file:
With RNA-seq you are not looking to assemble reads into contigs as you would if you were doing genomic sequencing.