Calculating percentage of A C U G content
4
2
Entering edit mode
9.9 years ago

Hey Everyone

I have data matrix of miRNAs mature sequences and I want to calculate the percentage of each nucleotide from the sequence. For example for below sequence

%A           %C    %G         %U             Sequence
13.63636     0     40.90909   45.45455       UGAGGUAGUAGGUUGUGUGGUU

I would really appreciate your help.. :)

RNA-Seq R • 4.8k views
ADD COMMENT
0
Entering edit mode

No I don't have these sequences in fasta format, I identified differentially expressed miRNAs and now what I am trying to do is to present them in xlsx format with detail like

 A        B         C          D         E       CV       Sdv   Median         miRNA2       energy 
13.866378 16.56889 14.930370  67.510306 31.798359 78.66506 22.761627 28.93486  hsa-miR-96-5p -2.1

long       Percent.A        Percent.C             Percent.U               Percent.G              
23                 34.7826086956522       17.3913043478261                  8.69565217391304                52.1739130434783
sequence                                            luciferase.assay.targets
UUUGGCACUAGCACAUUUUUGCU                          FOXO1; CDKN1A; FOXO3; SCARB1</pre>

Like this

ADD REPLY
0
Entering edit mode

Dear NicoBxl

I found what is problem in your code. Your code works perfectly fine for the the sequences in which 4 nucleotides are present as you mentioned in the above example...

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGC")

t(sapply(sapply(seq,strsplit,""),table))

                    A C G T
AAATTTTGGGGGCCCCCCC 3 7 5 4
ATGCATGC            2 2 2 2
ATGC                1 1 1 1


t(sapply(sapply(seq,strsplit,""),table))*100/rowSums(t(sapply(sapply(seq,strsplit,""),table)))
                         A        C        G        T
AAATTTTGGGGGCCCCCCC 15.78947 36.84211 26.31579 21.05263
ATGCATGC            25.00000 25.00000 25.00000 25.00000
ATGC                25.00000 25.00000 25.00000 25.00000

Now try this in which the frequency of one nucleotide is 0 for example

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGGG")
t(sapply(sapply(seq,strsplit,""),table))

     AAATTTTGGGGGCCCCCCC       ATGCATGC      ATGGG    
[1,] Integer,4                 Integer,4      Integer,3

and you will get this error if you will take percentage :)

Error in t(sapply(sapply(seq, strsplit, ""), table)) * 100 :  non-numeric argument to binary operator.

And I really don't know how to get rid of this problem :)

ADD REPLY
1
Entering edit mode

This is the benefit to using Biostrings.

ADD REPLY
0
Entering edit mode

Hey NicoBxl

Can you give any suggestion how to get rid of this problem?

ADD REPLY
1
Entering edit mode

Indeed there's a problem with this approach. I'll think about it. As Devon suggests, maybe use Biostrings instead ;)

ADD REPLY
2
Entering edit mode
9.9 years ago

If you use the Biostrings library, you can then just use the alphabetFrequency() function.

ADD COMMENT
2
Entering edit mode
9.9 years ago

In R:

If seq is an array containing your sequences:

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGC")

t(sapply(sapply(seq,strsplit,""),table))

                    A C G T
AAATTTTGGGGGCCCCCCC 3 7 5 4
ATGCATGC            2 2 2 2
ATGC                1 1 1 1

and for the %

t(sapply(sapply(seq,strsplit,""),table))*100/rowSums(t(sapply(sapply(seq,strsplit,""),table)))

                           A        C        G        T
AAATTTTGGGGGCCCCCCC 15.78947 36.84211 26.31579 21.05263
ATGCATGC            25.00000 25.00000 25.00000 25.00000
ATGC                25.00000 25.00000 25.00000 25.00000
ADD COMMENT
0
Entering edit mode

Dear NicoBxl Thanks for your help ..

If you have data frame or matrix then the result are completely different can you tell me why? because I have data matrix and even I am converting it into array

head(data)
     sequence                
[1,] "UAUACAAGGGCAGACUCUCUCU"
[2,] "UUCAGAUCCCAGCGGUGCCUCU"
[3,] "UAUACAAGGGCAAGCUCUCUGU"
[4,] "GGCUCCUUGGUCUAGGGGUA"
[5,] "AGGUGCUCCAGGCUGGCUCACA"
[6,] "GGAUCCGAGUCACGGCACCA"

and the output is

UGUCUACUACUGGAGACACUGG AUCCGCGCUCUGACUCUCUGCC UGCCCUUAAAGGUGAACCCAGU
  [1,] Integer,4              Integer,4              Integer,4
CACCCGGCUGUGUGCACAUGUGC UCUUCUCUGUUUUGGCCAUGUG AAAUUAUUGUACAUCGGAUGAG
           Integer,4               Integer,4              Integer,4
AAUCAUGUGCAGUGCCAAUAUG CUAUACAACUUACUACUUUCCC CAAGCUCGCUUCUAUGGGUCUG
[1,] Integer,4              Integer,3              Integer,4
ADD REPLY
0
Entering edit mode

Transform your matrix in an array. try:

seq <- as.character(data$sequence)

and then use the code I post before

ADD REPLY
0
Entering edit mode

For small sequence it works but for large sequence It give same result like above.

I have 2019 number of miRNAs sequences :-/

ADD REPLY
0
Entering edit mode

It should work even with 2019 sequences. Could you print seq[1:10] and put it here?

ADD REPLY
0
Entering edit mode

Yes I did that till 100 I am getting result like above you mentioned if I will try seq[1:150]

The result is changed.

> t(sapply(sapply(seq[1:100],strsplit,""),table))
                           A  C  G  U
UAUACAAGGGCAGACUCUCUCU     6  6  4  6
UUCAGAUCCCAGCGGUGCCUCU     3  8  5  6
UAUACAAGGGCAAGCUCUCUGU     6  5  5  6
GGCUCCUUGGUCUAGGGGUA       2  4  8  6
AGGUGCUCCAGGCUGGCUCACA     4  7  7  4
GGAUCCGAGUCACGGCACCA       5  7  6  2
AAUCGUACAGGGUCAUCCACUU     6  6  4  6
UCUUGAAGUCAGAACCCGCAA      7  6  4  4
ACCCCACUCCUGGUACC          3  9  2  3
UGGGGGAGCCAUGAGAUAAGAGCA   8  3 10  3

> t(sapply(sapply(seq[1:150],strsplit,""),table))
     UAUACAAGGGCAGACUCUCUCU UUCAGAUCCCAGCGGUGCCUCU UAUACAAGGGCAAGCUCUCUGU GGCUCCUUGGUCUAGGGGUA AGGUGCUCCAGGCUGGCUCACA GGAUCCGAGUCACGGCACCA
[1,] Integer,4              Integer,4              Integer,4              Integer,4            Integer,4              Integer,4
ADD REPLY
1
Entering edit mode
9.9 years ago

I figure out simple Solution :-)

seq <- c("AAATTTTGGGGGCCCCCCUUUC","ATGCATGUUUC","ATGGG")

#For count...
count<-sapply(c("A", "G", "C", "U"), function(nuc) str_count(seq, fixed(nuc)))

#You can choose any pattern for nucleotides like

c("A", "G", "C", "T") or c("G", "C")

#for `%`

percentage<-function(x) (x/sum(x)*100)
content.percentage<-t(apply(count,1,percentage))

Devon and NicoBxl thanks for your help :)

ADD COMMENT
0
Entering edit mode
9.9 years ago

If you have these in fasta format, you can do this with BBTools:

stats.sh in=sequences.fasta gc=gc.txt

The default format is

name    length    A    C    G    T    N    GC

"U" will be listed in the column for "T".

ADD COMMENT

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6