Hi (I'm not native in English so, be ready for some possible language flaws)
It seems that in the most of the cases, a considerable percentages of the transcriptomes gained from de novo transcriptome assembly have no blast hit (against NCBI nr database, for example),
May be some of them are created because of some sequencing or assembly errors but most of them must be novel genes or maybe valuable first-time discovered protein coding mRNAs (I guess)!
What is the best strategy to find out what are these hit-less transcripts.
I have heard that searching for having ORF ot CDS (e.g using transdecoder program or using ExPASy website is one way, but I am searching for new or better approaches for these huge data! How to classifying them and which characteristics of them are more important in a biological point of view?
Thank you for sharing your valuable experiences ;)
Have you done the searches already or are you just considering how to investigate a de novo transcriptome?
Is your organism rather novel/exotic without any known near relatives in sequence databases? What is considerable percentage for no blast hits? If the number turns out to be 90% of your data then I would be worried about misassembly being a dominant factor. Are you planning to/doing a blastp search?
Hi, I am working with a non model fish. I have used blastn, blastx, blastp and maybe tblastn, but there are many transcripts without any hits. of course they are not 90% , but in the case that you have about 600,000 transcripts, 20 or 30 % hit-less would be a considerable amount (I am not using "considerable" from the statistics point of view but of the number of un-annotated transcripts).
Hi Farbod (this is always implied, even if absent :-)) you should think about running cd-hit-est (as I wrote in other thread) to see if the total number of transcripts can be reduced. That 20-30% hit-less transcripts may only turn out to be ~10% (or less). That would not be surprising. Those may represent sequencing/assembly errors or may indeed turn out to be novel genes specific to your organism. Proving they are real/present will need independent experimental evidence. As to what they do would be a more difficult challenge (which would require functional genomics studies).
That said, enough fish are likely sequenced by now that surely there is a (reasonably close) relative genome available in the databases.
Hi and thank you my friend. I have done what you have told me before and the problem is this that I have some of these hit-less genes in my DEGs!
So, you have correctly point to the core of my question "or may indeed turn out to be novel genes specific to your organism"!
How we can realize that ? you have told with "functional genomics studies".
But is there any more simple and less expensive way to find out what are those gens? for me having ORF or having a good score in software like coding potential calculator ( http://cpc.cbi.pku.edu.cn/), shows that they are not junk materials, Am I right?
And the Zebrafish genome is the best choice and I have used its databases in my investigations and its data are present in nr and swissprot database, too. But is there any other way I can use it (zebrafish genomic data as a close species) that I do not know yet ? please tell me !
As @Wouter said Zebrafish genome may prove to be a good surrogate reference. But look to see if NCBI has another fish genome (that is evolutionarily closer) available.
Bioinformatics is good at generating hypotheses but they are just that .. until they are proven by experimental support. Just because there is some software that says that this sequence has coding potential does not necessarily mean that the potential is going to pan out in real life.
It is easy for me to say this (and you may have already put in months of work here) but based on the discussions we have had so far it sounds to me like you could use additional work on refining the transcriptome first. You could then go back and redo the DE analysis to see if new targets make better sense.
Functional genomic studies would require knockout/knockdown and/or other appropriate experimental techniques to modulate expression of these new genes to see if they have a phenotype in your fish. This is going to be expensive (in terms of cost and effort) and not directly useful in your quest of devising a PCR test. Ultimately may have to be done if you wish to patent/otherwise explore commercial potential of your find.
Hi, the "spotted gar" fish is more closely to my species than zebrafish,
and I could not understand your sentence "it sounds to me like you could use additional work on refining the transcriptome first". would you please kindly describe me more about that?
Ensembl has good bit of data available (along with GTF/GFF files) for spotted gar. You could skip zebrafish and use gar genome instead. See the gene models they have here and compare them to what trinity came up with.
NCBI's version of spotted gar genome resource is here.
I am not sure if the 600,000 transcript number (I think I am remembering that right from other thread) was before/after cd-hit-type analysis. If before then there must be some redundancy there that you could remove. When you align your original data to gar/zebrafish genome that can give you an idea of where the reads are mapping and if the exons in your fish are similar to ones known for gar/zebrafish.
genomax2, In the valuable link you have provided, it has 3 choice :
Download sequences in FASTA format for genome, transcript, protein
Download genome annotation in GFF, GenBank or tabular format
BLAST against Lepisosteus oculatus genome, transcript, protein
which approach is better in your opinion?
1- downloading "sequences in FASTA format for genome" and use BBMap, STAR, HISAT2 etc, same as what I have done with cufflink?
or
2- use my de novo trinity assembly and "BLAST against Lepisosteus oculatus genome, transcript, protein"
3- other thing that you want to share with me from your experiences
thanks
Depending on how much time/energy you can afford to spend you could do both 1 and 2.
Only word of caution is get your sequence/annotation from a single source (Ensembl or NCBI) and stick with them all through the analysis.
Have you tried mapping your reads to the zebrafish genome?
Dear WouterDeCoster, What do you mean by mapping reads to zebrafish genom? do you mean for example using bowtie to map my reads to zebrafish reference genome or set of chromosomes?
No, I did not try it as I thought that the proteins and nucleotides and transcription factors are present in the NCBI nr or Swissprot or Tremble or Zebrafish transcription factor database.
If you think in addition to my searches, this mapping process is useful; would you please kindly tell me about how to map them (software names or How-to-do related topics) and what I must expected from the results?
Correct. Use zebrafish genome (and the gene models) as a surrogate reference for analysis (map/count/DE). Depending on how close your fish is to zebrafish upwards of 80-90% of your
original sequence
data may map acceptably (test a sample or two first). Use a splice aware aligner BBMap, STAR, HISAT2 etc. sequence, annotation, index bundles for zebrafish can be downloaded from iGenomes.Indeed, things will probably get easier when you can do reference mapping and reference based read counting. When counting on the gene or exon level you will also no longer have 'trouble' with alternative isoforms.
Dear Genimax2 and WouterDeCoster, thank you for all your helps,
I have mapped another fish to a reference genome long ago and I am describing my steps here. would you please spend some time and check if it is still a correct pipe line ?
1- I first "bowtiebuild" the zebrafish genome.
2- Then I have used Tophat as follow:
I have perform this command 6 time for my all 6 samples (3 female and 3 male as biological replications) and as I remember the option "--coverage-search" was very time consuming, is it necessarystrong text?
3- then I have used cufflink (it was a gold standard those days!)
./cufflinks -p 10 -o cufflinks.J1.dir '/home/Softwares/bowtie-1.1.1-linux-x86_64/bowtie-1.1.1/tophat.F1/accepted_hits.bam'
again for six times and then
4- then I have merged all of them
echo cufflinks.F1.dir/transcripts.gtf >> assemblies.txt echo cufflinks.F2.dir/transcripts.gtf >> assemblies.txt echo cufflinks.F3.dir/transcripts.gtf >> assemblies.txt echo cufflinks.M1.dir/transcripts.gtf >> assemblies.txt echo cufflinks.M2.dir/transcripts.gtf >> assemblies.txt echo cufflinks.M3.dir/transcripts.gtf >> assemblies.txt
5- then cuffmerge
./cuffmerge -s '/home//Softwares/bowtie-1.1.1-linux-x86_64/bowtie-1.1.1/zebrafish_Genome_bowtiebuild.fa' -p 10 assemblies.txt
6- then cuffdiff
Is it Ok?
I will appreciate all your suggestions
it seems that the pipeline Bowtie2 ---> HISAT ----> StringTie ----> Ballgown is an update version !
Check a sample (or two) first to see what kind of alignments you get with gar genome (and zebrafish). In theory it should work well.
If you feel the transcriptome of gar (or zebrafish) would be a good substitute then you may also be able to use salmon or kallisto.
Dear genomax2 and @WouterDeCoster, Hi
I have another conversation with one of my other expert friend (like you) and he gave me this answer about the mapping of the reads to a close species for helping to solve the blast-hit-less transcripts:
"Aligning to a genome might help your assembly but I don't really see how it would give you a better annotation. If it doesn't match a known gene when assembled de novo then how will it map to a genome that it already didn't match?"
Do you have any idea in this regard? long to hear from you !
I don't think @Wouter or I said anywhere that aligning to a related genome will solve the problem of blast-hit less transcripts (if I did please link that part of the thread here so I can correct it). Using a known genome (and correct gene models there in) would help confirm which of the predicted transcripts in your set are reasonably correct. There is likely a big number that are artifacts of assembly.
I don't think we have discussed this before but have you tried to align your original data to the predicted transcriptome. How many of the reads are aligning and are they aligning to all 600K transcripts?
Hi , Yes you are right, it was a miss-understanding by me,
and, What do you mean by "predicted transcriptome"?
Do you men a reference transcriptome of my species of interest?
because there is not any for that I guess.
You should align the original reads that went into trinity assembly against the transcriptome fasta that came fro trinity. If the assembly is reasonable/good quality then original reads should align at a high percentage with most contigs getting hits. If they don't do so then you would need to question the validity of assembled transcripts.
Ah, yes I have done it, and the result was acceptable according to Trinity advice . Thanks