Question

Questions about how to read a fastq file and trimming primers

0

Entering edit mode

3.4 years ago

valentinavan ▴ 50

Hi,

I have sent some samples to a company for Illumina whole genome sequencing. I have two questions:

1) They told me that adaptors and barcodes have been trimmed from the raw data and that only primers were left. The info given is: "To trim the primers use the trimLeft argument in the filterAndTrim function of dada2. The size of the V3-V4 primer used for the project are 16 for forward and 24 for reverse." But it seems to me that dada2 filterandtrim function also needs the original untrimmed files, which I do not have. They only sent me the trimmed files. out <- filterAndTrim(fwd, filt.fwd, rev, filt.rev, trimLeft=c(16,24)) Am I wrong? Can you recommend another tool to remove primers?

2) I am also wondering if the primers have actually been removed (the person from the company could not reply to this question). If the primers were still there, I would have expected to see the same (16bp or 24 bp) sequence on the left end of each reads but i cannot see it. Here below an example of one of the forward reads I got from them:

@A00197:374:HH5YWDSX2:3:1101:2067:1000 1:N:0:CTTGTACACC+AAGCGCGCTT GTTTTCAACCAACACTGGTTCGGGCCTCCATACGGTGTTACCCGTACTTCACCCTGGCCAAGGGTAGATCACCTCGCTTCGCGTCTATTCCCAGCGACTTGTCGCCCGTTTCGCACTCGCTAACGCTACGGCTCGCTAACGCTTAACCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF:,FF:F,,,,FFFFF:::,FF,:F,FFF:,,FFF,:F:,F,F,F,,:F,:F,FFFF,F @A00197:374:HH5YWDSX2:3:1101:7925:1000 1:N:0:CTTGTACACC+TAGCGCGCTT AGACAAACCTGTCGAGTATGCGGTCCACATGCGGCGCCTACCTGCCGATCGAATGATGGACCGTCTGCTCGCCCGCGGACAGGTCACTGCGCCCATGGTCCGTCGGCTGGCGGAGAAGATGGCTCGCTTCCATGAGACGGCTGAGACGAG + FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00197:374:HH5YWDSX2:3:1101:9281:1000 1:N:0:CTTGTACACC+AAGCGCGCTT ATGCAGGCTGATTGTCTGCTTACGGCGATCAAATCCGCCCACAGACGATACGCCATACTTGGGATGACGCACCAGCGTCCCCCGTTTGAGACCCAGCGACCGCGTACCACCCTGCGGTCGTCTGATGCCGCCGTGTGCGAACTGCAGCAT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF

Thanks in advance

primers illumina fastq trimming • 4.1k views

ADD COMMENT • link updated 3.4 years ago by Istvan Albert 102k • written 3.4 years ago by valentinavan ▴ 50

score 1 · Answer 1 · 2022-02-26

1

Entering edit mode

3.4 years ago

Istvan Albert 102k

Run your data through FastQC, it will detect common adapters in the data. Then look at overrepresented sequences.

You can also run fastp - it will generate an HTML file where you can investigate kmers for hint.

Finally, you can count kmers in the data, though usually that is last resort.

ADD COMMENT • link 3.4 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you Istvan for replying.

I forgot to mention that I did run a fastQC analysis and for all my fastq the analysis fails the "adapter content" part because of Nextera Transposase sequences. In addition, for some files but not all of them, it also fails the "overrepresented sequence" part. I also run a fastp analysis only searching for "overrepresented sequences" and I get a very long list for each file and I not sure which ones are my primers.

1) I guess that I can remove the overrepresented sequences from the fastq files that had the "overrepresented sequence" part failed in the fastQC analysis.

2) But I am not sure how to get rid of the Nextera Transposase sequences that have been found in all my files.

Thanks again

ADD REPLY • link 3.4 years ago by valentinavan ▴ 50

0

Entering edit mode

if you run fastp it will find, report and trim common adapters on its own, the nextera adapter will likely be:

>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC

but it might be different with other sample preps

ADD REPLY • link 3.4 years ago by Istvan Albert 102k

score 0 · Answer 2 · 2022-02-27

0

Entering edit mode

3.4 years ago

AfinaM ▴ 30

Did they give you the primers that they used in their sequencing? Once you have that, you can also add in the list of adapters before you run fastQC so that you can also check whether the primers are still in your sequence data. This is my way of checking after pre-processing so hope it helps.

ADD COMMENT • link 3.4 years ago by AfinaM ▴ 30

0

Entering edit mode

I wish! They did not want to tell me the primer sequence, they said they cannot give them out!

ADD REPLY • link 3.4 years ago by valentinavan ▴ 50

0

Entering edit mode

Huh that is weird. They should provide you the sequence so that you could also rerun the analysis/processing on your side for validation. Btw, for your second question, you can use bbduk to remove any other adapter. Check out this link: bbduk guide

ADD REPLY • link 3.4 years ago by AfinaM ▴ 30

0

Entering edit mode

some of the sequencing primers may technically be considered as trade secrets, but these usually are not present in the data and get recognized and cut off before the data is reported

standard adapters are not secret and are listed in many sources

https://github.com/usadellab/Trimmomatic/tree/main/adapters

ADD REPLY • link 3.4 years ago by Istvan Albert 102k