I have wheat transcriptome data generated by illumina and I wish to know if this data is reliable or not. Wheat genome size is 17Gb, number of reads in each fastq file is ~32 million. Its a paired end sequencing. The problem is when I identify SNPs using this data, read depth is very less. many have 0,1 or 2 as read depth. Average of all the DP values is around 3 which seems very less. I also mapped the reads on to the wheat genome. These are the mapping statistics generated by STAR.
Number of input reads | 28480666
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 20282536
Uniquely mapped reads % | 71.22%
Average mapped length | 194.59
Number of splices: Total | 998719
Number of splices: Annotated (sjdb) | 604635
Number of splices: GT/AG | 759478
Number of splices: GC/AG | 25924
Number of splices: AT/AC | 3437
Number of splices: Non-canonical | 209880
Mismatch rate per base, % | 0.69%
Deletion rate per base | 0.01%
Deletion average length | 1.72
Insertion rate per base | 0.02%
Insertion average length | 1.92
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 5096669
% of reads mapped to multiple loci | 17.90%
Number of reads mapped to too many loci | 445314
% of reads mapped to too many loci | 1.56%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 9.21%
% of reads unmapped: other | 0.12%
Thanks
I have few suggestions. If you are interrogating transcriptome data via RNA-Seq then it is designed specifically to quantify transcriptomes and not genomes where you will target variants a.k.a SNPs as you are calling them.
You should better look for expression estimates and make a study of gene expression and downstream functional studies.
If you are interested in doing variant analysis then better to go for whole genome or whole exome. Depth in WGS/WES matters for the variant analysis. What have you employed for variant calling with this data, like the tools and the workflow? What parameters?
Any particular reason you want to use RNA-Seq for variant analysis and not WGS/WES?
If it's a budget constraint and you want to find both from RNA-Seq then you have to make a trade off. Your depth is not very high to fish out very reliable SNPs from your data however you can try with GATK variant calling from RNASeq. Pretty much detailed and informative and will be a good start.
Transcript depth and single nucleotide estimates are not the same. So understand that difference. If you read carefully and understand your motivation of what you want to do probably you will have better way to deal with this data.
Thanks for your suggestions
I have given a reply to the below answer, take a look there. Having said if data is not good for identifying DEGs at this stage is pretty strong apriori decision. Yes you need the ploidy status for your data if you have them with you and that can be pretty informative for designing the RNA-Seq analysis for DE analysis.
Just to make sure from the beginning, when you have the counts file from your samples and make a PCA, what do you observe? do they get distributed in space according to your condition or strong by another variability. This gives you a biological inference as well. Alternatively, perform the same with normalized expression data. Pretty amazing workflows are suggested by limma, edgeR and DESeq2.
There is a crude approach to still get it done but not very accurate but gives you a flavor of ploidy status based on read coverage. Check this. Good luck
I've not created counts table yet. Will do PCA afterward and see. Also, thank you for this link, its a really interesting point I learned from this discussion.
What do you mean with reliable?
Per definition, RNA-seq is not well suited for variant calling.
By reliable I want to know the quality of this data or know some way of doing it.
Please reformat your question and the read stats for people to read it well so that they can help you out!