extract subset of 21 nt. from csv file
1
0
Entering edit mode
5.2 years ago

Hi, I am trying to get 1-21nt, 2-22nt, 3-23nt etc till the end of sequences. For example, if I have sequences of 51 nt in length then I will get the 30 fragments with 21 nt in each fragment. I tried this script whcih looks to work fine but only for first 2 lines only.

fasta <- read.csv("trial.csv", header = TRUE, sep =",") 
dd1 <- as.character(fasta$Seq)
IDS <- as.character(fasta$Ids)
b <- numeric() 
final <- NULL
for (i in 1:length(dd1))
{
  for ( j in 1:nchar(dd1[i])) 
  {
  a <-dd1[i]
  b[j] <- substr(a, j, j+20)
  #print (nchar(b[j]))
  #print (nchar(dd1[i]))
  #print (length(dd1[i]))
  ifelse ((nchar(b[j]))==21 && (j<=(nchar(dd1[i])-21)),{
  print(b[j]) 
  next}, break())
  } 
final <- paste(IDS[i],b, sep = ",") 
write.table(final, paste(i,"final.csv")) 
}



  Example : trial.csv will look like this 
    Ids,Seq
    hsa_circ_0000013,GGCTCCAGGGAGCTTGGCTTCTGTAGAAGTTCTAAGGAAGCGGTACGAACTCCACGGCGGTGGGGCGCTAACTAGCAGGGACCCCTGCAAGTGTTGGTCGGGGGCCTCGAGCTGCCTGAGCTGACACGAGGGGAGGGGTCTGTGTAGCCAACAG    hsa_circ_0000026,CAATCCCACAGAGTATTGATGAGGAAACTGAAGTTTGGAGCGATCACATCATTTTCCCAAGAATTCCTAGAGGACCTGTGCAACAACCTCTTGAGGATCGAATCTTCACTCCCGCTGTCTCAGCAGTCTACAGCACGGTAACACAAGTGGCAAGACAGCCGGGAACCCCTACCCCATCCCCTTATTCAGCACATGAAATAAACAAGGGGCATCCAAATCTTGCGGCAACGCCCCCGGGACATGCATCGTCCCCTGGACTCTCTCAA
hsa_circ_0000037,ATACATTTGGGCCTGTCTACCTGCCTTTGGGGCAATTTGCAGGTTTTGAGAAGTAGAAATGAGGGTCTGGAGAGGGCATCTGTGAGCCTCTTCTGGGAACCCCTCCCTTGTAGGT
hsa_circ_0000050,GATGAATTCAAAAGACTATTTGCTAAATATGGAGAACCAGGAGAAGTTTTTATCAACAAAGGCAAAGGATTCGGATTTATTAAGCTTGTGAGTGTAATTGTTTGATTTTACGTAGAATTAAAAAGGGTGGGGGATTTTTTTGTCACTACAAACGCTGAAGGCTTGGTTTTTAAACTGGGGAGGATAAATTGATCTTTTAGATTTTTCACCATTCTTACAGGAAAAATGCTTGCGGTATAATGCATAATTGTTGCTACCTAAGAGAAAAGAGGGGTGGGGTGGGTAAACTAAGTGGTGTTAGTGGGTGCTGCCTAAAGGTAATGGTCGAACTGAGCTGAAGGAAGAAAGGGA

Any help is much appreiciated.

Thanks

R • 1.3k views
ADD COMMENT
1
Entering edit mode

If you are doing an assignment then by all means continue with your code.

If you just need to print all 21-mers from a file use kmercountexact.sh from BBMap suite like this.

kmercountexact.sh in=test.fa k=21 out=stdout.fa (to print to screen)
ADD REPLY
0
Entering edit mode

Thanks. I am going to try this as well.

ADD REPLY
1
Entering edit mode
5.2 years ago
JC 13k

I am not sure about the output you want, but here is how I would do in perl:

#!/usr/bin/perl

use strict;
use warnings;

my $len = 21;
while (<>) {
    next if (/^Ids/); # skip header
    chomp;
    my ($id, $seq) = split(/,/, $_);
    print "$id";
    for (my $i = 0; $i<=(length $seq)-$len; $i++) {
        my $ss = substr($seq, $i, $len);
        print ",$ss";
    }
    print "\n";
}

Example:

$ perl split21.pl < trial.csv > final.csv
$ head final.csv 
hsa_circ_0000013,GGCTCCAGGGAGCTTGGCTTC,GCTCCAGGGAGCTTGGCTTCT,CTCCAGGGAGCTTGGCTTCTG,
ADD COMMENT
0
Entering edit mode

Thanks for your time and effort. I really appreciate it.

ADD REPLY
0
Entering edit mode

Hi,

I am getting extra space when gereratig the outfile from the perl script. I tried to mofiy it but space included in between in my sequences.

As, you know previously I was trying from 1st position to last-21 nt but now I am looking for 1st to last+21nt (from 1-21 start position)

Any help is much appreciated.

I tried this code but getting error`

 #!/usr/bin/perl

    use strict;
use warnings;
#while(my $row = <>)
#{
#   $row =~ 's/\s//g';
#}
my $len = 21;
while (<>) {
    next if (/^Ids/); # skip header
    chomp;
    my ($id, $seq) = split(/,/, $_);
 my $ss = substr($seq, 0, 21);   
#print "$ss";
my $merge= ($seq.$ss);
#print "$merge";
$merge=~ 's/\s+$//g';
    for (my $i = 0; $i<=(length $merge)-$len; $i++) {
        my $st = substr($merge, $i, $len);
# $st= ~ 's/\s+$//g';
#if(length $st < $len) {next};
       print "$id,$st";
print "\n";
    }
  }
ADD REPLY

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6