samll RNA-seq data preproccess ( clipping the adapter sequences, removing reads with Ns)
1
0
Entering edit mode
9.6 years ago
bio_zhangxl ▴ 10

This is my data: http://www.ncbi.nlm.nih.gov/sra?term=SRP003871

I downloaded the .sra data, is all of .sra data raw data (that I need do the preprocess myself)?

How could I tell the adapter sequence if the data submitter do not tell me?

Should I convert it to .fastq and then clip the adapter sequences, remove reads with Ns? With which software?

The data if small RNA-seq (18-36 bp), the length of read is all 40 bp.

clip-adapter • 2.4k views
ADD COMMENT
1
Entering edit mode

Looks like an assignment. Is it one, bio_zhangxl?

ADD REPLY
0
Entering edit mode

yes, I do not know how to do it,but I have to get the result tody? can you give me a hand ,please

ADD REPLY
1
Entering edit mode
9.6 years ago
mark.ziemann ★ 1.9k

If you don't know the adapter sequence but you suspect it is there because the reads are uniformly 36bp/50bp long, then follow this recipe (Linux):

  • Extract the first million sequence reads, then cut out the first 20 nt and then identify the most abundant 20 mers.

    sed -n 2~4p sequence.fastq | cut -c-20 | sort | uniq -c | sort -k1gr | head

  • Take some of these sequences to miRbase and search by sequence. This will tell you the name of the most abundant miR and you can check the sequence of the canonical mature miR. Then take the sequence of the abundant mature miR (MIRSEQ) and "grep" the first 1000 out of the fastq file. Be sure to convert the "U" bases to "T".

    grep -m1000 MIRSEQ sequence.fastq | sort | uniq -c | sort -k1gr | head

  • This will bring up the most common sequence that contains the abundant mature miR 21mer. The sequence directly after the miR sequence is the adapter sequence. You should then repeat this procedure with the 2nd most abundant miR to confirm the adapter sequence. You should be able to identify the adapter in >80% sequences by searching a 10-20 bp string.

    grep -m1000 PREDICTEDADAPTER sequence.fastq

An alternative could be to use BBduk, although I have not tried it yet.

Once the adapter string is known, provide it as a parameter to tools such as FastxClipper, Trimmomatic, Cutadapt, etc. You only need to provide the first 20bp of the adapter sequence generally.

ADD COMMENT
0
Entering edit mode

I think the way that you find the adapter is efficient, and thank you very much.

AFTER removing the adapter and the read with Ns, the length of left reads is 18-40bp

Now I want map the reads to ref genome (to find piRNA (24-32bp)), I tried this:

bin/pass -d genome.fa -seeds_step 3 -fastq reads.fastq -check_block 500000 -Ns_percent -p 11111111 -sensitivety 3 fle 18 -l -cpu 12 -flc 1 -fid 90 -phred64 -b -sam -seq_gff > result.gff

but it does not work. I do not know how to adjust the paremeters.

The experiment aimed extracting reads with 18-36 bp-->sequencing--->then finding out the piRNA(24-32bp)

Software: http://pass.cribi.unipd.it/cgi-bin/pass.pl

ADD REPLY
0
Entering edit mode

I don't know much about the pass aligner, but there was a recent report suggesting BWA as the best small RNA aligner, which is more or less consistent with what I've found. That paper also provides the recommended parameters for miR alignment using BWA and Bowtie1/2.

ADD REPLY
0
Entering edit mode

Thank you very much ,now I try to align with BWA.

ADD REPLY

Login before adding your answer.

Traffic: 2230 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6