Question

How to do sequencing reads technical analysis and biological quality analysis for RNASeq dataset?

0

Entering edit mode

9.2 years ago

nalandaatmi ▴ 110

Hi all,

I have recently started to work on RNASeq analysis. I need to do the following two aspects of analysis first, before performing the TopHat pipeline for RNASeq. I have performed demultiplexing step and also generated the fastq files using basecalls from HiSeq.

Can you guys explain me why these analyses are important to do first hand and how to proceed further?

A. the sequencing reads technical analysis: I have to perform a genome wide alignment using the RNA_seq data sets of lane 1 to lane 6, and I have to output the information on the sequencing reads technical analysis like:

1. The reads duplication analysis;

2. The contamination analysis of the Illumina adaptor sequences;

3. The GC content analysis.

B. the biological quality analysis: using the mapping results above, also I need to output the biological quality analysis of the data sets like:

1. The percentage of the sequencing reads derived from the rRNA genes;

2. The percentage of the sequencing reads derived from the globin gene;

3. Because this is a strand specific RNA-seq, I have to include the sense and antisense information for the corresponding genes.

RNA-Seq next-gen alignment • 3.1k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Why cant the person tasking you also explain the rationale behind these orders?

ADD REPLY • link 9.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

The person tasking you with these really shouldn't. Aside from (A), which can be done entirely with FastQC, there are often nuances with how things should be implemented and you would need to be quite comfortable with RNAseq data before dealing with this.

Also, use a different sequencing facility next time. Needing to demultiplex things yourself is absolutely absurd.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

We sometimes do our own demultiplexing because we use barcode setups that the core doesn't like, especially in development of new in-line barcoding products. They'll set up a new demultiplexing for us but we don't ask until the thing is done.

ADD REPLY • link 9.2 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

Sure, but it doesn't sound like nalandaatmi is working on a new method.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

Can you explain me about the nuances with regards to RNAseq or redirect to some links where I can find.

Why do you say demultiplex things is absurd?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Making end users demultiplex standard data is absurd because that's a lot of extra work to get things set up when the sequencing facility could just do it as part of a standard pipeline. I've used a number of core facilities and companies over the years and have never needed to demultiplex things as a customer (I do now, but I'm not the customer any more :) ).

Regarding RNAseq, that's a long discussion. You'd be well advised to work together with someone locally the first time you do a new type of analysis like this (at least until you get a fair bit of experience under your belt).

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Devon, I am learning the NGS stuffs in a sequencing facility. The sequencing person in charge gave me the files which are directly from the HiSeq sequencing machine. I am interested in learning from the very first step of NGS reads. That's why I mentioned, I did perform demultiplexing step and generated fastq files from base calling files. I am trying to understand what are all the steps involved before downstream analysis.

As you mentioned that you do these NGS demultiplexing stuffs now, I would like to ask you this query. Using bcltofastq program I converted the base calls files to fastq files. When the fastq files are generated it has naming convention like these

WES01_AGTCCA_L001_R1_001.fastq, WES01_AGTCCA_L001_R1_002.fastq, WES01_AGTCCA_L001_R1_003.fastq, ..., WES01_AGTCCA_L001_R1_010.fastq
WES01_AGTCCA_L001_R2_001.fastq, WES01_AGTCCA_L001_R2_002.fastq, WES01_AGTCCA_L001_R2_003.fastq, ..., WES01_AGTCCA_L001_R2_010.fastq

WES01 is sample name, AGTCCA - barcode or index, L001 - Lane 1, R1 - Forward reads, R2 - Reverse reads, what is 001, 002, 003 to 010 after R1 and R2?

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by nalandaatmi ▴ 110

0

Entering edit mode

@nalandaatmi in my experience, the ' 001.fastq, 002.fastq, 003.fastq, ... ' you are referring to usually means that the fastq file was split into smaller parts. So if you merged the files together end-to-end, you would get all the reads.

ADD REPLY • link 8.7 years ago by steve ★ 3.5k

0

Entering edit mode

Thanks Tamir and Ryan for your suggestions.

Dear Pierre,

Thanks for your explanation. Yeah I have done that fastqc analysis, I am investigating these sections in the fastqc.html file for A section.

GC content
Sequence duplication levels
Overrepresented sequences

B section, working on it.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Hi All,

For section B, I am planning to take a list of ribosomal RNA genes and align it with my sample reads using bowtie2 tool. I assume the overall alignment rate which bowtie2 outputs will be the percentage of reads matching ribosomal RNA genes. Am I correct?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Open a new question. This is not a discssion forum but a Q&A. 1 Question + Answers, not 1 Question + Comments + Answers + more Questions.

ADD REPLY • link 9.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

Apologies, I thought I am still following up with my section b of my first question. Hereafter, I will make it a separate query. Thanks for letting me know about it.

ADD REPLY • link 9.2 years ago by nalandaatmi ▴ 110

score 0 · Answer 1 · 2015-09-16

This is a whole analysis pipeline that you need. For "A", fastqc http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ can help. For "B", you have to deal with combinations of bash/python/perl scripting and you can get what you need.

Why is "A" important: this is obvious, you need to check the quality of your sequences before deriving any biological hypothesis based on them.

Why is "B" important: I guess the person seems to "know" what to expect and he/she needs to have some "controls". Why not.