Hi there,
I am new to bioinformatics. I am trying to prepare fasta.gz files for uploading onto CPSS, a websever for miRNA-seq datasets. My data is from Gene Omnibus db. Basically the sample fasta file appears like this:
;>SRR1658346.1 HISEQ1:187:D0NWFACXX:3:1101:2565:2050 length=51
ATCATACAAGGACAATTTCTTTTAACGTCGTATGCCGTCTTCTGCTTGNAA
>SRR1658346.2 HISEQ1:187:D0NWFACXX:3:1101:2654:2232 length=51
TCGAGGAGCTCACAGTCTAGTATAACGTCGTATGCCGTCTTCTGCTTGAAA
>SRR1658346.3 HISEQ1:187:D0NWFACXX:3:1101:2870:2103 length=51
TTCAAGTAATCCAGGATAGGCTAACGTCGTATGCCGTCTTCTGCTTGAAAA
>SRR1658346.4 HISEQ1:187:D0NWFACXX:3:1101:3001:2147 length=51
TAGCACCATCCGAAATCAGTTTAACGTCGTATGCCGGCTTCTGCTTGAAAA
And my clean file should be like this (an example from CPSS):
>t0000001_823508
TGAGGTAGTAGATTGTATAGTT
>t0000002_757054
TGAGGTAGTAGGTTGTATAGTT
>t0000003_252586
ACAGTAGTCTGCACATTGGTT
With my limited knowledge, I can guess that there are adaptors along with the typical 21 nt long miRNA sequence. But I am not sure as how to trim them as the terminal sequences are of varying composition.
(edited) I am trying to re-analyse an miRNA dataset to discover some desirable miRNAs which are not reported in the relevant publication. Here's a link to the webtool.
The question needs some clarification:
I have replied to your queries in the main post. And, no I have not checked miRDeep2 yet.
Hi Michael,
I went with Galaxy for now, and not proper miRDeep2. The installation file is pretty large, and temporary internet issues are preventing me from downloading it, is taking pretty long time.
I did QC on galaxy and it could not detect adaptor to my surprise. I am not sure what could be done, I am writing another post, any pointer will be appreciated.
sRNAtoolkit is also an option