Percentage of bases in sequence in R
2
1
Entering edit mode
7.0 years ago
marija ▴ 80

Hello everyone, I need plot, which visualize percentage of nucletides per base in R. I have a percentage of bases but now I don't know how to visualize them.

fastq <- readDNAStringSet("https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq","fastq")
freq <- alphabetFrequency(fastq)
perc <- freq/width(fastq)*100
perc <- as.data.frame(perc)
perc

Any help? Thanks.

R • 3.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
5
Entering edit mode
7.0 years ago
library(Biostrings)
fastq <- readDNAStringSet("test.fa","fasta")
fastq
af=alphabetFrequency(fastq, as.prob = T,baseOnly=T)
barplot(af)

No need to convert per and as.prob=T will convert it percentages and you can bar plot directly without converting it data frame.

ADD COMMENT
0
Entering edit mode

Thank you. But sorry, I made a mistake, I thought percentage of nucelotides for very position. Something like this (sorry - paint): https://ibb.co/n47ifG

ADD REPLY
0
Entering edit mode
7.0 years ago
linus ▴ 360

So I coded the solution. However, I strongly recommend using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), which does create your plot and even generates more interesting FASTQ parameters.

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library("Biostrings")
library("tidyr")
library("ggplot2")

fastq <- readDNAStringSet("https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq","fastq")
# sequence matrix col = position, row = sequence
sequence_matrix <- do.call(rbind, lapply(fastq, function(seq){return(strsplit(as.character(seq),split = '')[[1]])}))
# calculate frequency by position
freq <- apply(sequence_matrix, 2, function(col){
  stat <- table(col)
  return(c(stat['A'], stat['T'], stat['G'], stat['C'], stat['N']))
})
row.names(freq) <- c('A', 'T', 'G', 'C', 'N')
freq <-freq/1000

# replace NA with 0 
freq[5,] <- sapply(freq[5,], function(x){
  ifis.na(x)){
    return(0)
  }else{
    return(x)
  }
})
freq <- t(freq)

freq <- cbind(1:nrow(freq), freq)
colnames(freq)[1] <- 'Position' 
# width to long format transformation
freq_to_plot <- gather(as.data.frame(freq), 'Type', 'Value', A:N)
#pltting
ggplot(data=freq_to_plot, aes(x=Position, y=Value, group = Type, colour = Type ))+
  geom_line()+
  theme_classic()+
  ylab('Frequency')+ 
  guides(colour=guide_legend("Nucleotide"))
ADD COMMENT

Login before adding your answer.

Traffic: 2956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6