Is this RNA-seq data reliable?
2
0
Entering edit mode
7.2 years ago
ashish ▴ 680

I have wheat transcriptome data generated by illumina and I wish to know if this data is reliable or not. Wheat genome size is 17Gb, number of reads in each fastq file is ~32 million. Its a paired end sequencing. The problem is when I identify SNPs using this data, read depth is very less. many have 0,1 or 2 as read depth. Average of all the DP values is around 3 which seems very less. I also mapped the reads on to the wheat genome. These are the mapping statistics generated by STAR.

Number of input reads |       28480666
              Average input read length |       200
                            UNIQUE READS:
           Uniquely mapped reads number |       20282536
                Uniquely mapped reads % |       71.22%
                  Average mapped length |       194.59
               Number of splices: Total |       998719
    Number of splices: Annotated (sjdb) |       604635
               Number of splices: GT/AG |       759478
               Number of splices: GC/AG |       25924
               Number of splices: AT/AC |       3437
       Number of splices: Non-canonical |       209880
              Mismatch rate per base, % |       0.69%
                 Deletion rate per base |       0.01%
                Deletion average length |       1.72
                Insertion rate per base |       0.02%
               Insertion average length |       1.92
                     MULTI-MAPPING READS:
Number of reads mapped to multiple loci |       5096669
     % of reads mapped to multiple loci |       17.90%
Number of reads mapped to too many loci |       445314
     % of reads mapped to too many loci |       1.56%

  UNMAPPED READS:
   % of reads unmapped: too many mismatches |       0.00%
         % of reads unmapped: too short |       9.21%
                 % of reads unmapped: other |       0.12%

Thanks

rna-seq SNP • 3.5k views
ADD COMMENT
3
Entering edit mode

I have few suggestions. If you are interrogating transcriptome data via RNA-Seq then it is designed specifically to quantify transcriptomes and not genomes where you will target variants a.k.a SNPs as you are calling them.

  • You should better look for expression estimates and make a study of gene expression and downstream functional studies.

  • If you are interested in doing variant analysis then better to go for whole genome or whole exome. Depth in WGS/WES matters for the variant analysis. What have you employed for variant calling with this data, like the tools and the workflow? What parameters?

  • Any particular reason you want to use RNA-Seq for variant analysis and not WGS/WES?

  • If it's a budget constraint and you want to find both from RNA-Seq then you have to make a trade off. Your depth is not very high to fish out very reliable SNPs from your data however you can try with GATK variant calling from RNASeq. Pretty much detailed and informative and will be a good start.

  • Transcript depth and single nucleotide estimates are not the same. So understand that difference. If you read carefully and understand your motivation of what you want to do probably you will have better way to deal with this data.

ADD REPLY
0
Entering edit mode

Thanks for your suggestions

  1. My aim is just to identify some candidate differentially expressed genes which I will use for wet lab experiments after validation by real time PCR.
    1. I did SNP analysis because I did not know any other way to check read depth at every locus. SNP or variant identification is not my motive with this data.
    2. If this data is not good for identifying DEGs, I cannot get it sequenced again. So yes its a budget constraint also. How can I make the most of this data?
    3. I will try the tool you suggested on another data set. Thank you suggesting.
ADD REPLY
0
Entering edit mode

I have given a reply to the below answer, take a look there. Having said if data is not good for identifying DEGs at this stage is pretty strong apriori decision. Yes you need the ploidy status for your data if you have them with you and that can be pretty informative for designing the RNA-Seq analysis for DE analysis.

Just to make sure from the beginning, when you have the counts file from your samples and make a PCA, what do you observe? do they get distributed in space according to your condition or strong by another variability. This gives you a biological inference as well. Alternatively, perform the same with normalized expression data. Pretty amazing workflows are suggested by limma, edgeR and DESeq2.

There is a crude approach to still get it done but not very accurate but gives you a flavor of ploidy status based on read coverage. Check this. Good luck

ADD REPLY
0
Entering edit mode

I've not created counts table yet. Will do PCA afterward and see. Also, thank you for this link, its a really interesting point I learned from this discussion.

ADD REPLY
2
Entering edit mode

What do you mean with reliable?

Per definition, RNA-seq is not well suited for variant calling.

ADD REPLY
0
Entering edit mode

By reliable I want to know the quality of this data or know some way of doing it.

ADD REPLY
0
Entering edit mode

Please reformat your question and the read stats for people to read it well so that they can help you out!

ADD REPLY
4
Entering edit mode
7.2 years ago

I think the main problem is with the polyploid nature of the wheat genome if I am not wrong. So better to cross check with variant caller, how it handles the polyploid genomes. GATK well suited for diploid genomes.

You can do variant calling using RNA-Seq but it depends on what you want to get out of the data. Are you looking at coding variants or something else ?

If you are looking at DEG, you can quantify the genes using featureCounts and do a differential analysis. Reliability of the differential genes depends on how many replicates you have per condition. As you have plans to validate the results by PCR, you can take top most DE genes and do a validation to make sure the data is enough. But given that 17GB genome and around 90K genes, it may not be enough.

I just read from your comments that

I did SNP analysis because I did not know any other way to check read depth at every locus

You have bedtools genomeCov for that.

I would suggest you to first understand the complexity of plant genomes and consider the ploidy and repetitive nature etc to think about analyzing the data.

ADD COMMENT
0
Entering edit mode

This is a very relevant point about the ploidy status that has been suggested.

  • Read depth can be measured by what is suggested.
  • If your goal is DE analysis your coverage of 32M is good enough to start with if you have already enough replicates that support your experimental design.
  • Yes, I agree GATK is very much well suited for diploid genomes.
  • Are you aware of the ploidy status probably this link might shed some more information? Check this biostars link and see that RNA-Seq is not the correct way for estimating the ploidy status based on

coverage.

ADD REPLY
0
Entering edit mode

I got my answer and with 4 replicates I am starting to feel that this data will be good enough.

ADD REPLY
0
Entering edit mode

Thank you geek_y this genomeCov has solved my problem. and just because you mentioned it , wheat is hexaploid with around 114k genes.

ADD REPLY
2
Entering edit mode
7.2 years ago

prior alignment you could also use fastqc : https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ to assess the quality of your reads

ADD COMMENT
0
Entering edit mode

yes, I used that but it wasn't telling me read depth which I initially wanted to know.

ADD REPLY
0
Entering edit mode

Intrinsic to RNA-seq is the very variable read depth - because of variable expression. Read depth is not an important metric in RNA-seq, but the total number of reads (32M) is more relevant.

ADD REPLY
0
Entering edit mode

Is there any criteria to predict how many reads means good quality. Number of reads must be dependent on transcriptome size. Say my aim is not to find any novel transcripts, how many reads do you think are enough for wheat RNA-seq.

ADD REPLY
0
Entering edit mode

Our paper will tell you ;) search RNAonthebecnh NAR paper. Sorry for short promotion ;) . However it also depends fairly on the species. For humans it's 19M paired end data in RNAseq.

ADD REPLY
1
Entering edit mode

I haven't read the paper - but that probably also depends on what you want to quantify and how lowly abundant some transcripts are. I don't remember exactly but another paper stated that "new discoveries" in RNA-seq (particularly low expressed genes) only saturate at ~400M reads. But that's of course nonsense for any typical application.

ADD REPLY
1
Entering edit mode

wow thats a lot, yes that is why I wanted to say, what we did was quantification on human and we wanted to understand read length, coverage, SE vs PE impact on transcript abundance and DE analysis and that was pretty interesting find. There we found 19M as a plateau (PE) good enough for DE analysis at the gene level. If one is interested in novel transcripts and alternative splicing sites, it is not. However, it is also dependent to what extent this will be applicable in the wheat transcriptome. We are aware of rough protein coding genes that we can obtain for humans but is it the case of the wheat as well? Sorry, my knowledge is limited to humans and mice as of now.

Probably the OP needs to check some papers of wheat transcriptome analysis. Also probably the OP might ask the wet lab person and the sequencing facility how they zeroed on this coverage. These will be plausible things to take into account.

ADD REPLY
0
Entering edit mode

I read your paper and also checked it on github. It's a good one and cleared many questions I previously had.

ADD REPLY

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6