Hi, I am trying to remove library added sequences from SMART-ChIP-seq generated libraries. My ChIP DNA sequences were in varying length, so I had DNA in between 15 nt - 200 nt with some sizes being overrepresented (25 nt, 35 nt, 130 nt, 150 nt, 200 nt). So the SMART-ChIP library adds a T tail on the 3' end of each DNA, and I only know the sequence of the index built in the primer used for the PCR reaction. I did a paired end sequencing with 150 bp length (it was the same price for me than asking smaller length or SE seq), which means that many of my reads contain sequences from the adapter region added by the library generation. To be able to align my reads to genome, I need to figure out how to trim these sequences. The problem is that because of poor sequencing, I don't even have an exact sequence to be cut (even because T tail generation is also random in length). Here I uploaded how some of my reads1 and reads2 look in the fastq file, I highlighted sequences starting form the T tail, and index sequences are green and blue:
https://docs.google.com/document/d/1c71jnYoynF8PYEXlfaowTyVkWvzML6VxROysD9djreA/edit
Can you please help me out which software would be suitable to trim and how can I set up a correct method to trim everything starting from a T tail at the 3' end? The main issue is that sequences are not identical after T tail which is mainly because of poor sequencing.
I am also in a question how to determine and trim sequences added to the 5' end of each read?
How about simply trimming TTTTTTTT. If I get you correctly then any read must have that once the sequencing extends over the "biologically meaningful" insert that you want to map. Of course such as generic sequence might trim some actual non-technical T stretches, but you probably need to accept that. Tools like cutadapt or fastp can do that.
Because you did paired-end sequencing, you should be able to precisely trim adapters without even knowing the adapter sequence. If your sequencing "goes off the end" of an insert, then there will be a part of that sequence that is the reverse-complement of the paired read. Where that shared sequence ends, is where the clipping starts. Trimmomatic can do this - see Trimmomatic and look up "Palindrome Mode".
Also, I've written a lightweight tool TrimViz to visualise what your trimmer has done, by subsampling before- and after- trimming fastqs, I'd highly recommend it because many trimming tools can be a bit opaque in terms of whether they are doing what you think they are doing.
Another thing - many aligners are pretty good at just soft-clipping adapter sequences so you could just throw unclipped sequences at your aligner - in this case you can QC the soft-clipping in the .bam file by viewing in Trimviz using soft-clipping mode.
edit: You can use the clustering of aligner soft-clipped sequences in Trimviz to guess the wayward adapter sequences appearing at the 5' ends because the aligner will mostly soft-clip them. The last page of the report "Individual read trimming, grouped by trimming class" may contain some examples of 5'-clipped reads, the clipped parts of which you should be able to copy-paste from the .pdf file, then you could potentially clip them in the original fastq file using e.g. cutadapt, before re-aligning.
How about simply trimming
TTTTTTTT
. If I get you correctly then any read must have that once the sequencing extends over the "biologically meaningful" insert that you want to map. Of course such as generic sequence might trim some actual non-technical T stretches, but you probably need to accept that. Tools like cutadapt or fastp can do that.