Dear Biostars, Hi,
I have the RNA-seq data of a fish (3 cond1 and 3 cond2 as biological replicates) and I have done Trinity de novo assembly and DEG analysis on these data. Now the draft genome of that species have released. I want to run a genome-guided DEG analysis, too, to compare the results.
Using @Kevin and other Biostars helps, I select HISAT2 -> StringTie -> Ballgown pipeline.
At the first step, I have indexed my genome:
./hisat2-build -p 6 '/home/salmon-genome-2018/GCF_SSa_v1.0_genomic.fna' ht2_base_salmon_genome
BUT it seems that there is several options/switches I can add to HISAT2 mapping script:
My first script for one of the replicates (C1) was as:
./hisat2 -p 6 -x ht2_base_salmon_genome -1 '/RNA_Seq_Data/C1_clean_left.fq' -2 '/RNA_Seq_Data/C1_clean_right.fq' -S '/RNA_Seq_Data/C1.sam' &> C1.sam.info
and 6 SAM files have been created, But then I found in the StringTie that
"be sure to run HISAT2 with the
--dta
option for alignment, or your results will suffer."
I have asked here and @Vijay Lakhujani believed that using --dta
is a better idea.
Then I used this script and re-run all 6 mapping, again:
./hisat2 -p 6 -x --dta ht2_base_salmon_genome -1 '/RNA_Seq_Data/C1_clean_left.fq' -2 '/RNA_Seq_Data/C1_clean_right.fq' -S '/RNA_Seq_Data/C1.sam' &> C1.sam.info
Now, there is another comment/hint in StringTie manual as:
It is highly recommended to use the reference annotation information when mapping the reads, which can be either embedded in the genome index (built with the --ss and --exon options, see HISAT2 manual), or provided separately at run time (using the --known-splicesite-infile option of HISAT2).
Q: What is the standard/preferred script for HISAT2 program for mapping? What must I do now? re-run all 6 mapping adding --ss and --exon
to my previous script? How I can find splice site information of this newly released genome?
~Thanks
@Farbod: The quote you posted above gives you pointers on what to do. You can
--ss and --exon
options and then re-align your data.--known-splicesite-infile
and re-align the data using the current genome index.Dear @genomax, Hi
I do not have any "file of known splice sites", So in this case you mean I should re-create a new indexed genome using "--ss and --exon" and then map all the reads again using "--dta". yes?
Can we say it is the preferred / standard approach of using HISAT2 for genome-guided?
Don't think so , when you don't provide the file of known splice sites it will (probably) use a default set of potential splice sites. The advice to use this kind of option (same for the --ss and --exon) is that it can do more specific mapping as it then can filter out alignments that do not coincide with a known splice site (might even speed up the alignment step for the same reason) .
the consequence is that you will get less novel genes (== not present in the gff file) or models with alternative splice sites. It all depends on what your goal is
Dear @lieven.sterck, hi and thanks.
It seems that your idea is different from @genomax,
You believe that as I do not have "the file of known splice sites", I should use the SAM files obtained from my script using "--dta" and proceed to the next level. Correct?
I don't think I differ a lot from what genomax is telling you. If it is in the manual I might as well consider to (re-)run it the --ss and --exon activated (for the genome index building).
I merely wanted to point out that if, for some reason, you don't have the required info (known splice sites) or you do not want to rerun the mapping (building a new index) you could proceed with what you have. If you have access to the genome annotation file, I would consider to use it. This of course given that the gene prediction result is any good. If it is of low quality then proceed without it.
Thank you,
How I can understand that there is any known splice sites information for this "whole genome shotgun sequence" ?
it's structure is as :
chromosome 1
chromosome 2
.
.
chromosome 33
chromosome 34
AND many "unplaced genomic scaffold " !
Since you don't have a file with known transcript models you don't have the splice sites file.
You can use
stringtie
to create new transcript models. Without known transcripts there could be a large number of false positives that you would need to deal with. Since you havetrinity
assembled transcripts you could use those to compare withstringtie
generated ones and see if you can reconcile them into a usable dataset.Hi @genomax, You mean using this genome that is not well-annotated, the genome guided approach is not so much valuable, correct?
of course they have run some RNA-seq in their genome sequencing project, too (would you please have a look?).
By "since I have trinity assembled transcripts" , can I use this so-called genome-guided approach and check for similar (overlapped between two methods) DEGs and probable alternative splicing and consider them as true DEGs?