I have smallRNA seq data that were produced by poly-adenylating the 3' end prior to PCR amplification. For this reason many of the reads contain the poly-a stretches. But since the sequenced fragments were so short, sometimes the tail is sequenced through and there are nucleotides on the 3' end of the poly-a tail. Like this:
TACTGAATGGCAGTGATGATAAAAAAAAAAAAAAAAAAAATTCCGCC
I have used PRINSEQ to remove the poly-a tail, but it seems that it only remove tails at the very end of the sequence. Anyone know how i can remove tails including the following nucleotides like the read above?
Thanks,
Jon
I think this is a bit aggressive, now you can't have 5A's anywhere inside the reads.
Thanks! It seems to work fine. I don't fully understand this, but reads like this is not trimmed: ACGAGTAGGGGAAAAAAAAAAC. Can you explain why? Btw, do you think 10 A's is too short for a poly-a stretch?
echo -e "1\nACGAGTAGGGGAAAAAAAAAAC\n3\n4" | awk 'NR%4==2 {gsub(/AAAAA.*$/,"");} {print}'
returnsACGAGTAGGGG
hereThis line gives me the same result as yours. But not when I run the command on my file of reads. Maybe I should have mentioned that I have a fasta file. It could be something with the fasta headers? They are like this: >HWI-ST486:386:D1UMHACXX:3:1101:2632:2144 1:N:0:TGACCA
But the command works fine for the read I showed in the question...
Can you tell me how to delete the correspondig quality value of the trimmed poliA from the fastq?
THANKS
Thanks for suggesting this awk command. It works perfectly. However, only the reads are trimmed and not the quality scores. Because of this the trimmed fastq file cannot be used for tophat, for instance. Can you please suggest a command to trim quality scores to match the reads trimmed using your awk command? I appreciate your help.
I've updated, can you please try the new command ?
Thanks so much! It works!
Thanks again for adding the new code that also trims the quality score as well. There is a slight issue with the code. It seems there is a mismatch between the length of read and quality score after trimming. The quality score has an extra character at the end. But, it could be fixed by modify the code slightly. Thanks.