Say I have a fasta containing 3 sequences...
>J00138:90:HG7YJBBXX:5:1101:19593:1719_2
CTTTACGCCCAGTAAT
>J00138:90:HG7YJBBXX:5:1101:20334:2299_1
TAAGGCGAAGGCTTCTTTC
>J00138:90:HG7YJBBXX:5:1101:20334:2299_2
GCTTTAATCTACATTTATAGACTGATTAATAACC
>J00138:90:HG7YJBBXX:5:1101:20730:2316_1
TAACTAGTTGCAAGGGTTTT
I want my sequence data to look like this:
>J00138:90:HG7YJBBXX:5:1101:19593:1719_2
CTTTACGCCCAGTAATNNNNNNNNNNNNNN
>J00138:90:HG7YJBBXX:5:1101:20334:2299_1
TAAGGCGAAGGCTTCTTTCNNNNNNNNNNN
>J00138:90:HG7YJBBXX:5:1101:20334:2299_2
GCTTTAATCTACATTTATAGACTGATTAATAACC
>J00138:90:HG7YJBBXX:5:1101:20730:2316_1
TAACTAGTTGCAAGGGTTTTNNNNNNNNNNN
Are there any programs or scripts that could accomplish this in a reasonable timeframe? I have thousands of sequences. Thanks!
Someone mentioned something like n = 150 ; seq = seq + ("N" * n) ; seq = seq[:n]
but I have no idea how I would put this into a python script. Thanks for all your help!
That doesn't look like a fasta file to me. Can you correctly post your input data?
Should be rather easy to write a few lines of python to get this done.
Thanks! I saw a similar thread about padding fastq files but I couldn't figure out how to translate the script for fasta files. Essentially all I want to do is homogenize all the reads within a fasta file so they are all the same length by padding anything less than 150 with Ns.
This doesn't sound like a good idea to me. Why do you want to pad the reads with Ns?
I am trying to use a script called REAGO to find 16S fragments from metagenomic data and it requires the length of all sequences to be the same for paired end reads. Most of my sequencing distribution is skewed between 100 to 150 bp. But very few read pairs after adapter trimming are 150 bp...usually the read 2s are slightly shorter. So I want to pad the sequences so I don't end up having to throw data away.
You might be using the wrong for the job. Good tools don't have strange and arbitrary requirements to mutate your data before they work.
Also, it makes absolutely no sense for read 2 to be shorter than read 1 after adapter-trimming. They should be the same length, since the adapter sequence will start at the same position in each read. Also, the majority of reads should have zero adapter sequence, for 150 bp reads - if that's not the case, it sounds like library prep screwed up, unfortunately.
It is likely that you are doing adapter-trimming incorrectly. I recommend this approach; some other methods will unnecessarily trim actual genomic sequence from the end of reads. There is no justification for adapter-trimming paired reads to different lengths - if you are using software that does that, it is flawed, or you are using it incorrectly. When the reads are properly trimmed, R1 and R2 will be the same length. Perhaps you are conflating adapter-trimming with quality-trimming? R2 almost always has lower quality than R1. Unlike adapter-trimming, quality-trimming is not strictly necessary, and I don't recommend it in any situation that is sensitive to read lengths.
Just like Brian I also wonder why you would want to do this. For the sake of the script, do you a priori know that everything should have a length of 150nt? Or should the script first find the longest read, then add nucleotides to all those which are shorter?
These are HiSeq paired end reads. They won't be any bigger than 150bp. Everything is <=150bp.
Since you said you are learning python, what have you tried? For working with sequences, have a look at SeqIO from Biopython, for example in the very good tutorial and cookbook.
You want to loop over te fasta records, and add to every seq object
N * (150 - len(seq))
Cross-posted:
http://seqanswers.com/forums/showthread.php?p=204602
Please link other sites when you cross-post, so people don't waste their time answering a question that has already been answered.
I apologize Brian. I'll make sure I do that next time!