regarding to Fastqc and SRA file format
1
0
Entering edit mode
9.1 years ago
Rahul ▴ 30

Hello,

I have downloaded SRA data from NCBI and converted it into Fastq file (Pair end sequences), then I analysed the sequences using fastqc. The following results I got, which I think ok. But still confuse. Can anybody shade some light on this aspect. (Per base sequence quality http://dropcanvas.com/#aG53s1AtEBgLv1)

##FastQC    0.11.2
>>Basic Statistics  pass
#Measure    Value
Filename    PPlf.fastq
File type   Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences 24000000
Sequences flagged as poor quality   0
Sequence length 75
%GC 44

Can I use fastq sequences derived from SRA format directly for assembly and scaffolding purpose? Or else I will have to do pre processing like removal of low quality reads,trimming of low quality bases,adapter removal?

regards

rahul

ncbi_now • 4.1k views
ADD COMMENT
0
Entering edit mode

There are many people doing assembly. Be aware that even if it works, your results are probably largely a trash, since, it really is more complicated than alignment. If your work goes easy, that is an indication, that you are doing something very very wrong.

And yes, do the filtering, read the papers and ask questions, preferably email software authors. Aim for software that is used a lot, so you can actually achieve something meaningful

ADD REPLY
0
Entering edit mode

I thank you very much for your valuable suggestions

ADD REPLY
3
Entering edit mode
9.1 years ago
pevsner ▴ 420

Yes, you can use the FASTQ files you download using SRA Toolkit (or FASTQ files you get from any source) for a variety of purposes such as assembly or alignment to a reference genome. Many people consider it a good idea to assess the quality of the reads in your file, and FastQC is a very popular tool for doing this. In the NCBI NOW workshop we show you how to use FastQC (first in Galaxy then on the command line).

In your question you showed the results of the basic statistics output of FastQC. I find that's helpful to see the sequence length (or if there are multiple lengths the shortest and longest are reported) and GC content. Sequences flagged to be filtered and removed by Illumina's Casava are reported in the "Sequences flagged as poor quality" field so that doesn't reflect analysis by a FastQC module and often isn't too informative.

You ask about the plot of quality scores across all bases. The image you link to shows a typical box and whisker plot. (The red line is the median value, the yellow box spans the inter-quartile range (25-75%), the upper and lower whiskers correspond to the 10% and 90% points, and the blue line is the mean quality score.)

In your example the boxes all occur in the "green zone" of excellent quality scores (on the y-axis, range 28-34). The main idea of such plots is that they show the position along each read on the x-axis, and typically the quality drops as one approaches the end of the read. This reflects the inherent limitation of the sequencing chemistry in which calling bases gets more difficult later in reads. For examples of very good and very bad reads see a tutorial here.

If the last position(s) in your reads are of dramatically lower quality then you may wish to trim them. Various programs are available to do this. You can find some in Galaxy in the Tools panel in the section "NGS: QC and manipulation". These are derived from FASTX-Toolkit, a collection of tools for FASTA and FASTQ preprocessing. You can use FASTX-Toolkit on the command line. For de novo assembly (which is not something I'm experienced in) many people remove low quality reads and trim adapter sequences. I suggest you have a look at the FastQC documentation which discusses its variety of useful analysis module outputs including adapter content.

ADD COMMENT
0
Entering edit mode

Dear sir,

Thanks for showing interest in my post. I am grateful that you took the time to write this post and guiding me.I have trimmed my sequences in between 63-75 bp and ran fastqc over it.I think I am through with the trimming process but still confuse about adapter trimming. If I see the Fastqc report it seems like adapter content is ok.Any help on this problem would be greatly appreciated.

Fastqc report http://dropcanvas.com/#Swo108wp3Y633h

Sincerely
Rahul

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6