how to remove adapter sequence in RNAseq fastq file from a paper.
2
2
Entering edit mode
9.5 years ago
cfarmeri ▴ 210

Hi, I'm an undergraduate student. Please help me.

I want mapping-bam-file from sra-dataset from NCBI in order to analyze heterogeneity of mouse ESC.

The dataset is generated from a paper below.(GSE60749)

Roshan M.Kumar et al. Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature(2014)

Please teach me how to get adapter sequence used in this experimentation

and what I should use for quality control(Prinseq?ShortRead?).

For example, I want to try GEO Sample GSM1486817 sra file.

Can anyone give me process of quality control of this sra?

I thank you for reading it through.

Any help will be appreciated.

RNA-Seq • 11k views
ADD COMMENT
3
Entering edit mode
9.5 years ago
iraun 6.2k
  1. Download data
  2. Convert from.sraformat to .fastq with SRA Toolkit: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
  3. Check the quality using fastQC package: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. In the generated report, you will see if your sequences have adapters and if so, their names.
  4. Remove the adapters and bad quality sequences using:

These are the general steps that you should follow in order to perform a quality control of the raw data.

Hope it helps.

ADD COMMENT
0
Entering edit mode

You can also reproduce all these steps in Genestack platform:

  1. Import data. Genestack will automatically recognise file format during files import.
  2. Run Raw Reads QC report app, which is based on FastQC and PRINSEQ tools.
  3. Run Trim adaptors and contaminants app. It's based on Fastq-mcf and use the list of universal adaptors (about 300 sequences).

There are also other preprocess apps that you can use to improve the quality of your data.

ADD REPLY
1
Entering edit mode

Actually using using FastQC is not recommended for finding adapters for the simple fact that it is not able to find adapters which it does not known a priori (i.e. they are no in its database). Therefore if that GSE60749 if uses some adapters which are not known by FastQC, then FastQC will not find them. FastQC is not able to find unknown adapters.

ADD REPLY
0
Entering edit mode

Obviously if those adapters are not known, fastQC will not know. But in general, for the majority of the experiments a well known adapters are used. FastQC will retrieve as over-represented sequence all the sequences that are repeated more than X times. These sequences could be adapters, contaminants, polyA... If they are adapters and there are present in fastQC db, you'll have a "tag" indicating it. If it isn't in the db, you'll have also the sequence but without "tag", I mean, you won't know if it is an adapter or another type of repeated seq. Just try it and see what it happen.

ADD REPLY
0
Entering edit mode

Actually, our experience is that FastQC will fail to find adapters in most of the cases.

Our experience is that in 99% of the cases researchers do not validate the results of FastQC and they trust blindly the info which FastQC gives regarding the adapters.

ADD REPLY
0
Entering edit mode

Incidentally, BBMerge is able to find unknown adapters, if the reads are paired:

bbmerge.sh in=reads.fq outadapter=adapters.fa reads=1m
ADD REPLY
0
Entering edit mode

Indeed BBMerge is able to find unknown adapters. I used it myself all the time. Another one is fusioncatcher's remove_adapter script

There are actually even more than these tools for finding unknown adapters.

ADD REPLY
1
Entering edit mode

Thank you so much everyone!!

To Evgeniia Golovina

I dont know Genestack platforms. This may makes my analysis so smooth. I can omit the process to install some apps for RNAseq analysis.

To enxxx23

Your advice is very helpful for me. I would like to try another way.

To airan

I got some over-represented sequences from FastQC report. There same over-represented sequences between some samples.

I guess they would be adpters. But I dont find the "tag" you said...

To Brian Bushnell

Thanks!! I want to try BBMerge right now.

ADD REPLY
2
Entering edit mode
9.5 years ago
h.mon 35k

You can find the adapters used by reading a bit and googling around. "Nextera XT DNA Sample preparation reagents (Illumina)" were used to prepare the samples (as found here under "study summary", or here under "library").

You can check for adapter contamination with FastQC.

edit: if the downstream analysis to be performed is mapping with BWA, Bowtie2 or any mapper which performs local alignment, adapter should not have a major impact. Besides, the reads he pointed to in his question are 25+25, which should be shorter than Nextera typical insert size, and adapter contamination should be really low.

ADD COMMENT
0
Entering edit mode

Thanks!!

I can get a hint from your help!.

I cannot find the adapter sequenes used by the advices, but I continuously would like to search!!

(I understand "Nextra XT DNA Sample preparation reagents(Illumina)" is used in this experimentation and search Illumina website to get the adapter sequences. However I still cannot get it...)

ADD REPLY

Login before adding your answer.

Traffic: 1900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6