Entering edit mode
3 months ago
prithvi.mastermind
▴
60
I want to determine n and c terminal nucleotide bases from fasta sequences via R or bash scripting. Can anybody please suggest some corrections in my prepared R code as it is assigning a fixed length of nucleotide bases which is not the actual case. Please help to modify the current code or suggest some bash commands.
This is my input file:
>MN200573.1_India
AACCCAAAAAATCCACGTGAAAATAAGCTGAAACAACCAGGAGACAGAGCAGATGGACAGCCAGCAGGAG
ACAGAGCAGATGGACAGCCAGCAGGTGATAGAGCAGCTGGACAACCAGCAGGTGATAGAGCAGATGGACA
GCCAGCAGGCGATAGAGCAGCTGGACAGCCAGCAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGA
GCAGCTGGACAGCCAGCAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGAGCAGCTGGACAGCCAG
CAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGAGCAGCTGGACAACCAGCAGGTGATAGAGCAGC
TGGACAACCAGCAGGAGATAGAGCAGATGGACAACCAGCAGGAGATAGAGCAGCTGGACAGCCAGCAGGA
GATAGAGCAGCTGGACAGCCAGCAGGAGATAGAGCAGCTGGACAGCCAGCAGGAGATAGAGCAGCTGGAC
AGCCAGCAGGAAATGGTGCAGGTGGACAGGCAGCAGGAGGAAATGCGGCAAACAAGAAGGCAGAAGACGC
AGGAGGAAACGCAGGAGGACAGGGACAAAATAATGAAGGTGCGAATGCCCCAAATGAAAAGTCTGTGAAA
GAATACCTAGATAAAGTTAGA
# Load required library
if (!requireNamespace("Biostrings", quietly = TRUE)) {
install.packages("BiocManager")
BiocManager::install("Biostrings")
}
library(Biostrings)
# Read the FASTA file
fasta_file <- "India.txt"
sequences <- readDNAStringSet(fasta_file, format = "fasta")
# Initialize a data frame to store the results
results <- data.frame(
Sequence_Name = names(sequences),
N_Terminal = character(length(sequences)),
C_Terminal = character(length(sequences)),
stringsAsFactors = FALSE
)
# Extract the N-terminal (first 10 residues) and C-terminal (last 10 residues)
for (i in seq_along(sequences)) {
seq <- as.character(sequences[[i]])
n_terminal <- substr(seq, 1, 10)
c_terminal <- substr(seq, nchar(seq) - 9, nchar(seq))
results[i, "N_Terminal"] <- n_terminal
results[i, "C_Terminal"] <- c_terminal
}
# Write the results to a CSV file
output_file <- "N_C_Terminals.csv"
write.csv(results, output_file, row.names = FALSE)
cat("N-terminal and C-terminal sequences extracted and saved to", output_file, "\n")
N- and C-terminal are protein terms. Nucleotide sequences have 5' and 3' ends.