Dear Biostars community,
I am a Master's Student and trying to process my first RNA-Seq Data.
After inspecting my Fastq Files, I noticed that almost 90% of Reads had a total read through with a long Poly-G which is apparently from the NextSeq Sequencer specifically.
I first did:
cutadapt --nextseq-trim=30 --minimum-length=20
which got rid of most the Poly-Gs but not the Adapters at the Ends. Also, it did significantly improve the Quality Score.
Then I did cutadapt -b ADAPTER_REV_REVCOM -B ADAPTER_FWD_REVCOM --minimum-length=20
I double checked the Adapter Sequences by doing grep on the raw Fastq-Data and they are correctly added for Read1 (fwd) and Read2 (rev) in the command.
The thing is, this only got rid of the Adapters from the Read2 but the Adapters on the Read1 were all still there. I then checked the fastq files which were created, and I just used cat <file.fastq> | awk 'NR%4==2' | grep <(Partial)
Adapter Sequence I supplied in the cmd line> and I got more than 100k Matches.
I also subsequently did trimmomatic ILLUMINACLIP:contam_file.fa:2:30:10
and the Adapter Sequences still remained. (The contam_file.fa I created myself which looks like this:
>Prefix/1
AATGATACGGCGACCACCGAGATCTACACGTAAGGAGACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Prefix/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCTCCTTACGTGTAGATCTCGGTGGTCGCCGTATCATT
>Prefix/3
CAAGCAGAAGACGGCATACGAGATGTTCAACCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>Prefix/4
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGTTGAACATCTCGTATGCCGTCTTCTGCTTG
To note, the Adapter Sequences I find in my Fastq Files are not complete full alignments of the Sequence I supplied in the cutadapt cmd line but they overlap to at least 70% and there are close to non mismatches for the alignments. I though cutadapt uses by default a minimum overlapping threshold of 3, so this shouldnt be the issue. Also the 0,1 mismatches/Sequence Length should not be an issue.
I hope somebody can help me! Thank you and have a nice day :)
sorry for the weird syntax of the contents of the "contam_file.fa". It is in the correct format, somehow while copy-pasting it got transferred that way.
Your file appears to be in correct multi-fasta format.
Please check out two other popular options that have easy to understand options.
fastp
- https://github.com/OpenGene/fastp?tab=readme-ov-file#simple-usagebbduk.sh
from BBMap suite : Guide here https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/