Question

Quality Control On Publicly Available Datasets

1

Entering edit mode

11.8 years ago

skm770 ▴ 150

Hi all,

I am in the process of analyzing number of publicly available datasets for RNA-seq and methylation at SRA and GEO. But I am stuck at the first step of quality control.

This is what I did :-

Downloaded and combined all the fastq files for a particular experiment into one file for that experiment
Ran the fastqc analysis : results were pretty bad

For quality control I have tried a couple of tools "cutadapt" and "trimmomatic" (for datasets which had Illumina as platform) running this command of cutadapt does not removes any adaptor sequences from the file.

/u1/tools/public/cutadapt/bin/cutadapt -q 10 -a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG -b ACACTCTTTCCCTACACGACGCTCTTCCGATCT GSM602252.fastq > GSM602252_trim.fastq

and the fastqc results obtained on running fastqc on trimmed file are the same. Kindly let me know where I am going wrong and what should I do to correct it.

The data set that I am trying to analyse also has ABI-SOLID RNA-seq data and it has it in the form of fastq files. I have never analysed ABI-SOLID data but from what I have heard/read it has two file the .csqual file and .csfasta file but in sra it is in the form of fastq file although the fastq file is has numbers instead of sequences in place. I would really appreciate if any body could provide pointers as to what to do in cases like this.

thanks!!

sra geo rna-seq methylation ngs • 4.3k views

ADD COMMENT • link updated 11.5 years ago by Prakki Rama ★ 2.7k • written 11.8 years ago by skm770 ▴ 150

0

Entering edit mode

No answer Anybody!!

ADD REPLY • link 11.8 years ago by skm770 ▴ 150

score 2 · Answer 1 · 2013-11-19

on a quick note it is not necessary that diff technologies (Illumina / ABI SOLiD) etc will have same adaptor sequence.

Few things to note:

If you mix the fastq from ABI and Illumina it is likely they will have different quality scoring schemes etc
It would be helpful if you could explain what you those bad QC results were. May be that would shed more light on whats going on here.
check for ABI adaptors sequences and see if they match the ones you are using

hth, -Abhi

score 0 · Answer 2 · 2013-11-20

We had similar case when dealing with illumina data. So we had to use not only the exact adapter sequence, but also the reverse, as well as reverse compliment of adapter. Moreover, we also observed cases where the adapters occured both in the beggining and also in the end of the sequencing read. So, we had to use multiple '-b' option and perform the adapter trimming.