Question

regarding to Fastqc and SRA file format

0

Entering edit mode

9.6 years ago

Rahul ▴ 30

Hello,

I have downloaded SRA data from NCBI and converted it into Fastq file (Pair end sequences), then I analysed the sequences using fastqc. The following results I got, which I think ok. But still confuse. Can anybody shade some light on this aspect. (Per base sequence quality http://dropcanvas.com/#aG53s1AtEBgLv1)

##FastQC    0.11.2
>>Basic Statistics  pass
#Measure    Value
Filename    PPlf.fastq
File type   Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences 24000000
Sequences flagged as poor quality   0
Sequence length 75
%GC 44

Can I use fastq sequences derived from SRA format directly for assembly and scaffolding purpose? Or else I will have to do pre processing like removal of low quality reads,trimming of low quality bases,adapter removal?

regards

rahul

ncbi_now • 4.5k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by Rahul ▴ 30

0

Entering edit mode

There are many people doing assembly. Be aware that even if it works, your results are probably largely a trash, since, it really is more complicated than alignment. If your work goes easy, that is an indication, that you are doing something very very wrong.

And yes, do the filtering, read the papers and ask questions, preferably email software authors. Aim for software that is used a lot, so you can actually achieve something meaningful

ADD REPLY • link 9.6 years ago by stolarek.ir ▴ 700

0

Entering edit mode

I thank you very much for your valuable suggestions

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Rahul ▴ 30

Ram · Accepted Answer · 2015-10-18

Yes, you can use the FASTQ files you download using SRA Toolkit (or FASTQ files you get from any source) for a variety of purposes such as assembly or alignment to a reference genome. Many people consider it a good idea to assess the quality of the reads in your file, and FastQC is a very popular tool for doing this. In the NCBI NOW workshop we show you how to use FastQC (first in Galaxy then on the command line).

In your question you showed the results of the basic statistics output of FastQC. I find that's helpful to see the sequence length (or if there are multiple lengths the shortest and longest are reported) and GC content. Sequences flagged to be filtered and removed by Illumina's Casava are reported in the "Sequences flagged as poor quality" field so that doesn't reflect analysis by a FastQC module and often isn't too informative.

You ask about the plot of quality scores across all bases. The image you link to shows a typical box and whisker plot. (The red line is the median value, the yellow box spans the inter-quartile range (25-75%), the upper and lower whiskers correspond to the 10% and 90% points, and the blue line is the mean quality score.)

In your example the boxes all occur in the "green zone" of excellent quality scores (on the y-axis, range 28-34). The main idea of such plots is that they show the position along each read on the x-axis, and typically the quality drops as one approaches the end of the read. This reflects the inherent limitation of the sequencing chemistry in which calling bases gets more difficult later in reads. For examples of very good and very bad reads see a tutorial here.

If the last position(s) in your reads are of dramatically lower quality then you may wish to trim them. Various programs are available to do this. You can find some in Galaxy in the Tools panel in the section "NGS: QC and manipulation". These are derived from FASTX-Toolkit, a collection of tools for FASTA and FASTQ preprocessing. You can use FASTX-Toolkit on the command line. For de novo assembly (which is not something I'm experienced in) many people remove low quality reads and trim adapter sequences. I suggest you have a look at the FastQC documentation which discusses its variety of useful analysis module outputs including adapter content.