Hi,
I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.
Sample fastq:
@NS500799:417:HHMWHBGX7:1:11101:10296:1148:UMI:GAGACT:
GCCCTGCCTTCAAATGAAANNAGTTCAGCATGCCC
+
AAAAAEEEEEE66<EE/66EE<EEA6/A//EAAEE
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
@NS500799:417:HHMWHBGX7:1:11101:22941:1238:UMI:CCGTTT:
GCGCAGAGTTTAAACGCGAATNNGCTCAACTGGTT
+
AA/AA//EEEEEEEEEEEEEEEEEEEAE/EEEEEE
Desired output:
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
Q1: I can't get the right command to do this. Something like
grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq
sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?
Q2: As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.
Thank you!!
A shorter read 1 is normal in scRNAseq, since it's mostly UMI and cell barcodes followed by a stretch of AAAAAAAAAA...
Ok, I have never worked on scRNAseq so didn't know this.
Thank you!! Will look into prinseq.
Also still wondering if it's possible to achieve this with just bash commands? It seems do-able I'm just not proficient at it.
Yes I did CEL-Seq. R1 is for UMIs and barcodes only.