How to determine n and c terminal nucleotide bases from fasta sequences via R or bash scripting?
2
1
Entering edit mode
3 months ago

I want to determine n and c terminal nucleotide bases from fasta sequences via R or bash scripting. Can anybody please suggest some corrections in my prepared R code as it is assigning a fixed length of nucleotide bases which is not the actual case. Please help to modify the current code or suggest some bash commands.

This is my input file:

>MN200573.1_India
AACCCAAAAAATCCACGTGAAAATAAGCTGAAACAACCAGGAGACAGAGCAGATGGACAGCCAGCAGGAG
ACAGAGCAGATGGACAGCCAGCAGGTGATAGAGCAGCTGGACAACCAGCAGGTGATAGAGCAGATGGACA
GCCAGCAGGCGATAGAGCAGCTGGACAGCCAGCAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGA
GCAGCTGGACAGCCAGCAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGAGCAGCTGGACAGCCAG
CAGGCGATAGAGCAGATGGACAGCCAGCAGGAGATAGAGCAGCTGGACAACCAGCAGGTGATAGAGCAGC
TGGACAACCAGCAGGAGATAGAGCAGATGGACAACCAGCAGGAGATAGAGCAGCTGGACAGCCAGCAGGA
GATAGAGCAGCTGGACAGCCAGCAGGAGATAGAGCAGCTGGACAGCCAGCAGGAGATAGAGCAGCTGGAC
AGCCAGCAGGAAATGGTGCAGGTGGACAGGCAGCAGGAGGAAATGCGGCAAACAAGAAGGCAGAAGACGC
AGGAGGAAACGCAGGAGGACAGGGACAAAATAATGAAGGTGCGAATGCCCCAAATGAAAAGTCTGTGAAA
GAATACCTAGATAAAGTTAGA
# Load required library

if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
library(Biostrings)

# Read the FASTA file
fasta_file <- "India.txt"
sequences <- readDNAStringSet(fasta_file, format = "fasta")

# Initialize a data frame to store the results
results <- data.frame(
  Sequence_Name = names(sequences),
  N_Terminal = character(length(sequences)),
  C_Terminal = character(length(sequences)),
  stringsAsFactors = FALSE
)

# Extract the N-terminal (first 10 residues) and C-terminal (last 10 residues)
for (i in seq_along(sequences)) {
  seq <- as.character(sequences[[i]])
  n_terminal <- substr(seq, 1, 10)
  c_terminal <- substr(seq, nchar(seq) - 9, nchar(seq))

  results[i, "N_Terminal"] <- n_terminal
  results[i, "C_Terminal"] <- c_terminal
}

# Write the results to a CSV file
output_file <- "N_C_Terminals.csv"
write.csv(results, output_file, row.names = FALSE)

cat("N-terminal and C-terminal sequences extracted and saved to", output_file, "\n")
terminal FASTA N C • 801 views
ADD COMMENT
1
Entering edit mode

N- and C-terminal are protein terms. Nucleotide sequences have 5' and 3' ends.

ADD REPLY
1
Entering edit mode
3 months ago

linearize and extract:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
awk -F '\t' '{N=length($2);printf("%s\t%s\t%s\n",$1,substr($2,1,10),substr($2,N-9));}'
ADD COMMENT
0
Entering edit mode
3 months ago
bk11 ★ 3.1k

In bash, I m not sure though if you wanted to perform something like below:

#!/bin/bash

input_file=$1
n_length=10
c_length=10

while read -r line; do
  if [[ $line == ">"* ]]; then
    echo "$line"
  else
    n_term=$(echo "$line" | cut -c1-$n_length)
    c_term=$(echo "$line" | rev | cut -c1-$c_length | rev)
    echo "N-terminal: $n_term"
    echo "C-terminal: $c_term"
    echo
  fi
done < "$input_file"

./bash.sh india.fasta
>MN200573.1_India
N-terminal: AACCCAAAAA
C-terminal: CCAGCAGGAG

N-terminal: ACAGAGCAGA
C-terminal: CAGATGGACA

N-terminal: GCCAGCAGGC
C-terminal: AGGAGATAGA

N-terminal: GCAGCTGGAC
C-terminal: GGACAGCCAG

N-terminal: CAGGCGATAG
C-terminal: ATAGAGCAGC

N-terminal: TGGACAACCA
C-terminal: GCCAGCAGGA

N-terminal: GATAGAGCAG
C-terminal: GCAGCTGGAC

N-terminal: AGCCAGCAGG
C-terminal: CAGAAGACGC

N-terminal: AGGAGGAAAC
C-terminal: GTCTGTGAAA

N-terminal: GAATACCTAG
C-terminal: TAAAGTTAGA
ADD COMMENT

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6