Please help me with adapter-trimming
2
0
Entering edit mode
7.4 years ago

I received fastq files from core they said they have de multiplexed it but when i ran fastqc i can still see some adapters, attached is figure

Adapter seq fastqc

My question along with fastq files with names like this

_TAGTCTTG_S7_L001_R1_001.fastq.gz
_TAGTCTTG_S7_L001_R2_001.fastq.gz

I also received some files which i am not sure what it has (i guess they are index)

TAGTCTTG_S7_L001_I1_001.fastq.gz
TAGTCTTG_S7_L001_I2_001.fastq.gz

zcat TAGTCTTG_S7_L001_I1_001.fastq.gz | head

@someinfo:1:1101:15235:1340 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT
+
CCDDDFFFFF
@someinfo:1:1101:15815:1395 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT
+
CCCCCFFFFF
@soomeinfo:1:1101:15719:1398 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT

when i look in to the actual fastq file i am not sure does it have both index and adapter? (core said they have demultiplex it) zcat _TAGTCTTG_S7_L001_R1_001.fastq.gz | head

@someinfo:1:1101:15235:1340 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTGTGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAGATCGGAAGAGCACACGTC
+
DDCDDDCDFFFFGGGGGGGGGGHHHHHHHGGGHHHHGGGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGGGHHHHHGGGGGHHHHHHHHHHHHHHHHHGGFGGGGHHHHHHHHHHHHHGGGGGHHGHGGHHHHGGGGHHHGHHGHHHHGHHHHHHHHGGGGGGAGGGGGGGGGGGFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFE
@someinfo:1:1101:15815:1395 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCCTCGACAGTGATGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTGTGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAGATCG
+
CCCCCCCCFFFFGGGGGGGGGGHHHHHHHGGGGHHHGGGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHGGGHHHHHGGGGGHHHHHGGHHHHHGHHHHGGGGGGGFFGFHHGHHHHGHGGGGGHGGFEGHHHHG-CCGHHGHHHHHHHHGHGHHHGGGGGGGGGGFFFFFFFFFFFFFFFFEFFFFFFFFFFF?DFFFF
@someinfo:1:1101:15719:1398 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCCTCGATCGACAGTGATGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTATGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAG

I did know about this and went ahead and aligned here is snapshot of how the alignments look in igv(4 samples paired end on Miseq (2*250)) sorted using base and used show soft clip in preferences.(suggested by some one from the core)

igv snapshot

How can i solve remove them with out loosing any information from actual reads

dna trimming • 4.2k views
ADD COMMENT
1
Entering edit mode

Clearly, your DNA library prep was not optimal. I am not sure what's going on in your IGV images, but it's very obvious from your first (% adapter) graph that the insert size was too short compared to read length.

Your IGV images look like amplicon data. Can you describe this in more detail? Did you authorize the sequencing center to PCR-amplify your DNA sample? There's no way such a high proportion of reads would have the exact same start site without amplification. Considering that none of the reads you posted agree with the reference, it looks bad. How did you align the reads?

Also, the specific reference would be helpful here... and, what you are trying to do is also always useful information.

I encourage you to post an insert-size histogram and detail the platform and read length used. I'm guessing you ran 2x250bp on a MiSeq, but it's not really possible to tell from what you posted.

Also:

bbmap.sh in=reads.fq ref=ref.fasta in1=r1.fq in2=r2.fq mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt ihist=ihist.txt bhist=bhist.txt covhist=covhist.txt lhist=lhist.txt

Posting those results would be useful, along with the screen output.

ADD REPLY
0
Entering edit mode

Apologies for incomplete information,Yes these were PCR amplicons that were sequenced, I aligned the reads using bwa mem, we were trying to induce a deletion and check if worked by sequencing exon 6 of a particular gene.

ADD REPLY
1
Entering edit mode

Oh... if you're looking for a somewhat long deletion, I suggest you try aligning with BBMap; it's very good at capturing those within the alignment of a read.

ADD REPLY
0
Entering edit mode

Sure, some additional info about the experiment attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 150bp amplicon, which was then sequenced with as a PE250 run.

ADD REPLY
1
Entering edit mode

You can detect the adapter sequences and trim them like this:

bbmerge.sh in1=r1.fastq.gz in2=r2.fastq.gz outa=adapters.fa
bbduk.sh in1=r1.fastq.gz in2=r2.fastq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe

Then map the trimmed (interleaved) reads and you'll get better results.

ADD REPLY
3
Entering edit mode
7.4 years ago
lshepard ▴ 480

Hi, I would recommend using a program such as trimmomatic to remove your adapter sequences. It handles paired-end reads quite well.

ADD COMMENT
0
Entering edit mode

Thanks, will look in to it where can i find adapter sequences? that have been highlighted in fastqc and what about the seond file that has which I am guessing to be index

ADD REPLY
0
Entering edit mode

trim_galore is a wrapper around trimmomatic that will automatically detect and remove common (including Illumina) adapter sequences: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

ADD REPLY
0
Entering edit mode

I certainly recommend removing adapters in all cases, but when >30% of reads have adapter sequence, that indicates a major problem in sequencing. I'd reject the data and have it sequenced correctly.

ADD REPLY
3
Entering edit mode
7.4 years ago
GenoMax 147k

You will not see adapter sequences that easily in the actual reads. You will need to use a scan/trim program to look for those. I recommend bbduk.sh from BBMap suite. BBMap suite comes with a comprehensive set of adapter sequences for many commonly used commercial adapters (in adapters.fa file in resources directory in BBMap software).

TAGTCTTGAT+TCTTTCCC are the index/tag read sequences. IndexRead1+IndexRead2 is how they are represented in the fastq read headers. You also have separate files with the index read sequences (I1 and I2 files).

ADD COMMENT

Login before adding your answer.

Traffic: 2427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6