Question

Calculating percentage of A C U G content

2

Entering edit mode

9.9 years ago

adnanjaved1988 ▴ 80

Hey Everyone

I have data matrix of miRNAs mature sequences and I want to calculate the percentage of each nucleotide from the sequence. For example for below sequence

%A           %C    %G         %U             Sequence
13.63636     0     40.90909   45.45455       UGAGGUAGUAGGUUGUGUGGUU

I would really appreciate your help.. :)

RNA-Seq R • 4.8k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

No I don't have these sequences in fasta format, I identified differentially expressed miRNAs and now what I am trying to do is to present them in xlsx format with detail like

 A        B         C          D         E       CV       Sdv   Median         miRNA2       energy 
13.866378 16.56889 14.930370  67.510306 31.798359 78.66506 22.761627 28.93486  hsa-miR-96-5p -2.1

long       Percent.A        Percent.C             Percent.U               Percent.G              
23                 34.7826086956522       17.3913043478261                  8.69565217391304                52.1739130434783

sequence                                            luciferase.assay.targets
UUUGGCACUAGCACAUUUUUGCU                          FOXO1; CDKN1A; FOXO3; SCARB1</pre>

Like this

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

Dear NicoBxl

I found what is problem in your code. Your code works perfectly fine for the the sequences in which 4 nucleotides are present as you mentioned in the above example...

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGC")

t(sapply(sapply(seq,strsplit,""),table))

                    A C G T
AAATTTTGGGGGCCCCCCC 3 7 5 4
ATGCATGC            2 2 2 2
ATGC                1 1 1 1


t(sapply(sapply(seq,strsplit,""),table))*100/rowSums(t(sapply(sapply(seq,strsplit,""),table)))
                         A        C        G        T
AAATTTTGGGGGCCCCCCC 15.78947 36.84211 26.31579 21.05263
ATGCATGC            25.00000 25.00000 25.00000 25.00000
ATGC                25.00000 25.00000 25.00000 25.00000

Now try this in which the frequency of one nucleotide is 0 for example

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGGG")
t(sapply(sapply(seq,strsplit,""),table))

     AAATTTTGGGGGCCCCCCC       ATGCATGC      ATGGG    
[1,] Integer,4                 Integer,4      Integer,3

and you will get this error if you will take percentage :)

Error in t(sapply(sapply(seq, strsplit, ""), table)) * 100 :  non-numeric argument to binary operator.

And I really don't know how to get rid of this problem :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

1

Entering edit mode

This is the benefit to using Biostrings.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hey NicoBxl

Can you give any suggestion how to get rid of this problem?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

1

Entering edit mode

Indeed there's a problem with this approach. I'll think about it. As Devon suggests, maybe use Biostrings instead ;)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Nicolas Rosewick 11k

Ram · Answer 1 · 2015-01-16

2

Entering edit mode

9.9 years ago

Devon Ryan 104k

If you use the Biostrings library, you can then just use the alphabetFrequency() function.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 104k

Ram · Answer 2 · 2015-01-16

2

Entering edit mode

9.9 years ago

Nicolas Rosewick 11k

In R:

If seq is an array containing your sequences:

seq <- c("AAATTTTGGGGGCCCCCCC","ATGCATGC","ATGC")

t(sapply(sapply(seq,strsplit,""),table))

                    A C G T
AAATTTTGGGGGCCCCCCC 3 7 5 4
ATGCATGC            2 2 2 2
ATGC                1 1 1 1

and for the %

t(sapply(sapply(seq,strsplit,""),table))*100/rowSums(t(sapply(sapply(seq,strsplit,""),table)))

                           A        C        G        T
AAATTTTGGGGGCCCCCCC 15.78947 36.84211 26.31579 21.05263
ATGCATGC            25.00000 25.00000 25.00000 25.00000
ATGC                25.00000 25.00000 25.00000 25.00000

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Dear NicoBxl Thanks for your help ..

If you have data frame or matrix then the result are completely different can you tell me why? because I have data matrix and even I am converting it into array

head(data)
     sequence                
[1,] "UAUACAAGGGCAGACUCUCUCU"
[2,] "UUCAGAUCCCAGCGGUGCCUCU"
[3,] "UAUACAAGGGCAAGCUCUCUGU"
[4,] "GGCUCCUUGGUCUAGGGGUA"
[5,] "AGGUGCUCCAGGCUGGCUCACA"
[6,] "GGAUCCGAGUCACGGCACCA"

and the output is

UGUCUACUACUGGAGACACUGG AUCCGCGCUCUGACUCUCUGCC UGCCCUUAAAGGUGAACCCAGU
  [1,] Integer,4              Integer,4              Integer,4
CACCCGGCUGUGUGCACAUGUGC UCUUCUCUGUUUUGGCCAUGUG AAAUUAUUGUACAUCGGAUGAG
           Integer,4               Integer,4              Integer,4
AAUCAUGUGCAGUGCCAAUAUG CUAUACAACUUACUACUUUCCC CAAGCUCGCUUCUAUGGGUCUG
[1,] Integer,4              Integer,3              Integer,4

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

Transform your matrix in an array. try:

seq <- as.character(data$sequence)

and then use the code I post before

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Nicolas Rosewick 11k

0

Entering edit mode

For small sequence it works but for large sequence It give same result like above.

I have 2019 number of miRNAs sequences :-/

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

It should work even with 2019 sequences. Could you print seq[1:10] and put it here?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes I did that till 100 I am getting result like above you mentioned if I will try seq[1:150]

The result is changed.

> t(sapply(sapply(seq[1:100],strsplit,""),table))
                           A  C  G  U
UAUACAAGGGCAGACUCUCUCU     6  6  4  6
UUCAGAUCCCAGCGGUGCCUCU     3  8  5  6
UAUACAAGGGCAAGCUCUCUGU     6  5  5  6
GGCUCCUUGGUCUAGGGGUA       2  4  8  6
AGGUGCUCCAGGCUGGCUCACA     4  7  7  4
GGAUCCGAGUCACGGCACCA       5  7  6  2
AAUCGUACAGGGUCAUCCACUU     6  6  4  6
UCUUGAAGUCAGAACCCGCAA      7  6  4  4
ACCCCACUCCUGGUACC          3  9  2  3
UGGGGGAGCCAUGAGAUAAGAGCA   8  3 10  3

> t(sapply(sapply(seq[1:150],strsplit,""),table))
     UAUACAAGGGCAGACUCUCUCU UUCAGAUCCCAGCGGUGCCUCU UAUACAAGGGCAAGCUCUCUGU GGCUCCUUGGUCUAGGGGUA AGGUGCUCCAGGCUGGCUCACA GGAUCCGAGUCACGGCACCA
[1,] Integer,4              Integer,4              Integer,4              Integer,4            Integer,4              Integer,4

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by adnanjaved1988 ▴ 80

Ram · Answer 3 · 2015-01-19

I figure out simple Solution :-)

seq <- c("AAATTTTGGGGGCCCCCCUUUC","ATGCATGUUUC","ATGGG")

#For count...
count<-sapply(c("A", "G", "C", "U"), function(nuc) str_count(seq, fixed(nuc)))

#You can choose any pattern for nucleotides like

c("A", "G", "C", "T") or c("G", "C")

#for `%`

percentage<-function(x) (x/sum(x)*100)
content.percentage<-t(apply(count,1,percentage))

Devon and NicoBxl thanks for your help :)

Ram · Answer 4 · 2015-01-16

0

Entering edit mode

9.9 years ago

Brian Bushnell 20k

If you have these in fasta format, you can do this with BBTools:

stats.sh in=sequences.fasta gc=gc.txt

The default format is

name    length    A    C    G    T    N    GC

"U" will be listed in the column for "T".

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Brian Bushnell 20k