Question

samll RNA-seq data preproccess （ clipping the adapter sequences, removing reads with Ns)

0

Entering edit mode

9.6 years ago

bio_zhangxl ▴ 10

This is my data: http://www.ncbi.nlm.nih.gov/sra?term=SRP003871

I downloaded the .sra data, is all of .sra data raw data (that I need do the preprocess myself)?

How could I tell the adapter sequence if the data submitter do not tell me?

Should I convert it to .fastq and then clip the adapter sequences, remove reads with Ns? With which software?

The data if small RNA-seq (18-36 bp), the length of read is all 40 bp.

clip-adapter • 2.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by bio_zhangxl ▴ 10

1

Entering edit mode

Looks like an assignment. Is it one, bio_zhangxl?

ADD REPLY • link 2.4 years ago by Ram 44k

0

Entering edit mode

yes, I do not know how to do it,but I have to get the result tody? can you give me a hand ,please

ADD REPLY • link 9.6 years ago by bio_zhangxl ▴ 10

Ram · Answer 1 · 2015-04-21

1

Entering edit mode

9.6 years ago

mark.ziemann ★ 1.9k

If you don't know the adapter sequence but you suspect it is there because the reads are uniformly 36bp/50bp long, then follow this recipe (Linux):

Extract the first million sequence reads, then cut out the first 20 nt and then identify the most abundant 20 mers.

sed -n 2~4p sequence.fastq | cut -c-20 | sort | uniq -c | sort -k1gr | head
Take some of these sequences to miRbase and search by sequence. This will tell you the name of the most abundant miR and you can check the sequence of the canonical mature miR. Then take the sequence of the abundant mature miR (MIRSEQ) and "grep" the first 1000 out of the fastq file. Be sure to convert the "U" bases to "T".

grep -m1000 MIRSEQ sequence.fastq | sort | uniq -c | sort -k1gr | head
This will bring up the most common sequence that contains the abundant mature miR 21mer. The sequence directly after the miR sequence is the adapter sequence. You should then repeat this procedure with the 2nd most abundant miR to confirm the adapter sequence. You should be able to identify the adapter in >80% sequences by searching a 10-20 bp string.

grep -m1000 PREDICTEDADAPTER sequence.fastq

An alternative could be to use BBduk, although I have not tried it yet.

Once the adapter string is known, provide it as a parameter to tools such as FastxClipper, Trimmomatic, Cutadapt, etc. You only need to provide the first 20bp of the adapter sequence generally.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by mark.ziemann ★ 1.9k

0

Entering edit mode

I think the way that you find the adapter is efficient, and thank you very much.

AFTER removing the adapter and the read with Ns, the length of left reads is 18-40bp

Now I want map the reads to ref genome (to find piRNA (24-32bp)), I tried this:

bin/pass -d genome.fa -seeds_step 3 -fastq reads.fastq -check_block 500000 -Ns_percent -p 11111111 -sensitivety 3 fle 18 -l -cpu 12 -flc 1 -fid 90 -phred64 -b -sam -seq_gff > result.gff

but it does not work. I do not know how to adjust the paremeters.

The experiment aimed extracting reads with 18-36 bp-->sequencing--->then finding out the piRNA(24-32bp)

Software: http://pass.cribi.unipd.it/cgi-bin/pass.pl

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by bio_zhangxl ▴ 10

0

Entering edit mode

I don't know much about the pass aligner, but there was a recent report suggesting BWA as the best small RNA aligner, which is more or less consistent with what I've found. That paper also provides the recommended parameters for miR alignment using BWA and Bowtie1/2.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by mark.ziemann ★ 1.9k

0

Entering edit mode

Thank you very much ,now I try to align with BWA.

ADD REPLY • link 9.6 years ago by bio_zhangxl ▴ 10