Hello all,
I have a directory with csv files (tab delimited). Each file have different sections (4 headers):
miRDeep2 score novel miRNAs reported by miRDeep2 novel miRNAs, estimated false positives novel miRNAs, estimated true positives known miRNAs in species known miRNAs in data known miRNAs detected by miRDeep2 estimated signal-to-noise excision gearing
novel miRNAs predicted by miRDeep2
provisional id miRDeep2 score estimated probability that the miRNA candidate is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
mature miRBase miRNAs detected by miRDeep2
tag id miRDeep2 score estimated probability that the miRNA is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value mature miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
#miRBase miRNAs not detected by miRDeep2
miRBase precursor id total read count mature read count(s) star read count remaining reads UCSC browser NCBI blastn miRBase mature sequence(s) miRBase star sequence(s) miRBase precursor sequence
Each row for each section has its own data. I am interested in 2 sections (novel miRNAs predicted by miRDeep2 and mature miRBase miRNAs detected by miRDeep2).
On the first place I would like to generate a file of frequencies of mature miRBase miRNAs detected by miRDeep2 for the mature miRBase miRNA column:
mature miRBase miRNA Freq Filenames
hsa-miR-486-5p 154 file1,file2,file3...
hsa-miR-93-5p 135 file1,file4,file5...
hsa-let-7i-5p 210 file2,file4,file5...
Note: If a file has the same maturemiRBase miRNA repetead, it should counts as one
And in the other hand is the same but for the novel miRNAs predicted by miRDeep2 and its column provisional id
An example of the data:
File1:
miRDeep2 score novel miRNAs reported by miRDeep2 novel miRNAs, estimated false positives novel miRNAs, estimated true positives known miRNAs in species known miRNAs in data known miRNAs detected by miRDeep2 estimated signal-to-noise excision gearing
10 1 0 +/- 1 1 +/- 0 (66 +/- 48%) 2656 683 85 (12%) 20.8 1
9 1 0 +/- 1 1 +/- 0 (66 +/- 48%) 2656 683 87 (13%) 21 1
8 2 0 +/- 1 2 +/- 1 (79 +/- 31%) 2656 683 87 (13%) 21.2 1
novel miRNAs predicted by miRDeep2
provisional id miRDeep2 score estimated probability that the miRNA candidate is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
22_4891 40.2 66 +/- 48% - 75 72 0 3 yes - - - - cacugcaaccucugccuccggu ugagaggcagagguugcagugg ugagaggcagagguugcaguggcacgaucucaggucacugcaaccucugccuccggu 22:20012651..20012708:-
2_775 8.6 79 +/- 31% - 14 10 0 4 yes - - - - gaagacagucgaacuugacu uuuagugaggcccucggaucagc uuuagugaggcccucggaucagcccgcugggucagcccacugcccuggcggaacgcugagaagacagucgaacuugacu 2:133011980..133012059:-
22_4800 4.8 78 +/- 26% - 7 4 0 3 yes - - - - cccuccucuccuguggccacaga ugguccaacgacaggaguagg ugguccaacgacaggaguaggcuuguauuuaaaagcggccccuccucuccuguggccacaga 22:20052702..20052764:+
mature miRBase miRNAs detected by miRDeep2
tag id miRDeep2 score estimated probability that the miRNA is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value mature miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
8_2073 4167777.2 66 +/- 48% - 8174911 8174878 0 33 yes hsa-miR-486-5p - - - uccuguacugagcugccccgag cggggcagcucaguacaggau uccuguacugagcugccccgaggcccuucaugcugcccagcucggggcagcucaguacaggau 8:41517960..41518023:-
8_1990 4167744.1 66 +/- 48% - 8174846 8174816 0 30 yes hsa-miR-486-5p - - - uccuguacugagcugccccgag cggggcagcucaguacaggau uccuguacugagcugccccgagcugggcagcaugaagggccucggggcagcucaguacaggau 8:41517961..41518024:+
12_2896 11840.9 66 +/- 48% - 23222 23209 0 13 yes hsa-let-7i-5p - - - ugagguaguaguuugugcuguu cugcgcaagcuacugccuug ugagguaguaguuugugcuguuggucggguugugacauugcccgcuguggagauaacugcgcaagcuacugccuug 12:62997470..62997546:+
#miRBase miRNAs not detected by miRDeep2
miRBase precursor id total read count mature read count(s) star read count remaining reads UCSC browser NCBI blastn miRBase mature sequence(s) miRBase star sequence(s) miRBase precursor sequence
hsa-mir-451a 3801 3800 0 1 - - aaaccguuaccauuacugaguu - cuugggaauggcaaggaaaccguuaccauuacugaguuuaguaaugguaaugguucucuugcuauacccaga
hsa-mir-941-5 914 914 0 0 - - cacccggcugugugcacaugugc - ugugcacaugugcccagggcccgggacagcgccacggaagaggacgcacccggcugugugcacaugugccca
hsa-mir-574 239 239 0 0 - - cacgcucaugcacacacccaca ugagugugugugugugagugugu - gggaccugcgugggugcgggcgugugagugugugugugugagugugugucgcuccggguccacgcucaugcacacacccacacgcccacacucagg
File2:
miRDeep2 score novel miRNAs reported by miRDeep2 novel miRNAs, estimated false positives novel miRNAs, estimated true positives known miRNAs in species known miRNAs in data known miRNAs detected by miRDeep2 estimated signal-to-noise excision gearing
10 0 0 +/- 0 0 +/- 0 (0 +/- 0%) 2656 686 89 (13%) 25.5 1
9 0 0 +/- 0 0 +/- 0 (0 +/- 0%) 2656 686 89 (13%) 25.3 1
8 2 0 +/- 1 2 +/- 1 (78 +/- 32%) 2656 686 91 (13%) 25.8 1
novel miRNAs predicted by miRDeep2
provisional id miRDeep2 score estimated probability that the miRNA candidate is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence
15_3320 8.9 78 +/- 32% - 22 21 0 1 yes - - - - agguagauagaacaggucuugu
18_4183 8.8 78 +/- 32% - 16 15 0 1 yes - - - - cuucgaaagcggcuucggcu
9_2330 4.5 77 +/- 26% - 13 12 0 1 no - - - - ccagcccuguuccccaccccgc
mature miRBase miRNAs detected by miRDeep2
tag id miRDeep2 score estimated probability that the miRNA is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value mature miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence
8_1970 3713910.6 0 +/- 0% - 7284671 7284476 0 195 yes hsa-miR-486-5p - - - uccuguacugagcugccccgag
8_2047 3712258.7 0 +/- 0% - 7281431 7281218 2 211 yes hsa-miR-486-5p - - - uccuguacugagcugccccgag
13_3063 196960.2 0 +/- 0% - 386326 386325 0 1 yes hsa-miR-92a-1-5p - - - uauugcacuugucccggccugu
#miRBase miRNAs not detected by miRDeep2
miRBase precursor id total read count mature read count(s) star read count remaining reads UCSC browser NCBI blastn miRBase mature sequence(s) miRBase star sequence(s) miRBase precursor sequence
hsa-mir-451a 4971 4971 0 0 - - aaaccguuaccauuacugaguu - cuugggaauggcaaggaaaccguuaccauuacugaguuuaguaaugguaaugguucucuugcuauacccaga
hsa-mir-941-5 464 464 0 0 - - cacccggcugugugcacaugugc - ugugcacaugugcccagggcccgggacagcgccacggaagaggacgcacccggcugugugcacaugugccca
hsa-mir-1260b 70 70 0 0 - - aucccaccacugccaccau - ucuccguuuaucccaccacugccaccauuauugcuacuguucagcaggugcugcugguggugauggugauagucuggugggggcggugg
Any help is more than welcome! Thanks!
I'd use a one liner in Perl to extract the chunk of interest (starting with the constant header line and ending with blank line). In R, I'd apply it to the files as the
cmd
option to datatable::fread, then combine each file's results with datatable::rbindlist, then build a matrix of counts using datatable;:dcast.thanks, but I'm not familiar with perl, could you help me with the code? thanks