Question

Percentage of bases in sequence in R

1

Entering edit mode

7.4 years ago

marija ▴ 80

Hello everyone, I need plot, which visualize percentage of nucletides per base in R. I have a percentage of bases but now I don't know how to visualize them.

fastq <- readDNAStringSet("https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq","fastq")
freq <- alphabetFrequency(fastq)
perc <- freq/width(fastq)*100
perc <- as.data.frame(perc)
perc

Any help? Thanks.

R • 3.3k views

ADD COMMENT • link updated 7.4 years ago by linus ▴ 360 • written 7.4 years ago by marija ▴ 80

0

Entering edit mode

duplicate question of Visualize nucleotides for every position in R

ADD REPLY • link 7.4 years ago by linus ▴ 360

0

Entering edit mode

7.4 years ago

linus ▴ 360

So I coded the solution. However, I strongly recommend using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), which does create your plot and even generates more interesting FASTQ parameters.

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library("Biostrings")
library("tidyr")
library("ggplot2")

fastq <- readDNAStringSet("https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq","fastq")
# sequence matrix col = position, row = sequence
sequence_matrix <- do.call(rbind, lapply(fastq, function(seq){return(strsplit(as.character(seq),split = '')[[1]])}))
# calculate frequency by position
freq <- apply(sequence_matrix, 2, function(col){
  stat <- table(col)
  return(c(stat['A'], stat['T'], stat['G'], stat['C'], stat['N']))
})
row.names(freq) <- c('A', 'T', 'G', 'C', 'N')
freq <-freq/1000

# replace NA with 0 
freq[5,] <- sapply(freq[5,], function(x){
  ifis.na(x)){
    return(0)
  }else{
    return(x)
  }
})
freq <- t(freq)

freq <- cbind(1:nrow(freq), freq)
colnames(freq)[1] <- 'Position' 
# width to long format transformation
freq_to_plot <- gather(as.data.frame(freq), 'Type', 'Value', A:N)
#pltting
ggplot(data=freq_to_plot, aes(x=Position, y=Value, group = Type, colour = Type ))+
  geom_line()+
  theme_classic()+
  ylab('Frequency')+ 
  guides(colour=guide_legend("Nucleotide"))

ADD COMMENT • link 7.4 years ago by linus ▴ 360

score 5 · Accepted Answer · 2017-11-30

5

Entering edit mode

7.4 years ago

cpad0112 21k

library(Biostrings)
fastq <- readDNAStringSet("test.fa","fasta")
fastq
af=alphabetFrequency(fastq, as.prob = T,baseOnly=T)
barplot(af)

No need to convert per and as.prob=T will convert it percentages and you can bar plot directly without converting it data frame.

ADD COMMENT • link 7.4 years ago by cpad0112 21k

0

Entering edit mode

Thank you. But sorry, I made a mistake, I thought percentage of nucelotides for very position. Something like this (sorry - paint): https://ibb.co/n47ifG

ADD REPLY • link 7.4 years ago by marija ▴ 80