I am using BBDuk for quality/adapter trimming and filtering a min. length of 40 bp for my RNAseq PE tumor/normal samples.
I am also trimming polyN reads (that are at least 75% of the read) for fear of them uniquely mapping.
My issue with prinseq is that I have to sort my PE files which takes a few hours for a single file.
#paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}"
#paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}"
perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 \
-custom_params "A 75%;T 75%;G 75%;C 75%" -min_len 40 -out_format 3 \
-out_good "${file1%_1.fastq}_tst" -out_bad null -log
Are there any (faster) alternatives for this purpose?
why do you sort the fastq file ?
From PRINSEQ FAQ
PRINSEQ requires sorted input files for paired-end or mate-pair data processing. If your two FASTQ files of a paired-end (or mate-pair) dataset need to be sorted by their sequence identifiers, you can use the following one-liner in Linux/Unix/OSX:
This will first join the 4 lines (paste - - - -) of a FASTQ entry into a single line (with each of the 4 original lines separated by tabs), then sort them by their sequence identifier (-k1,1 -t " " specifies everything before the first space for the sorting, which is our sequence identifier), and write each entry again in 4 lines by replacing the tabs with line breaks. The sorted entries are then saved in a new file specified after the ">" sign.
The files I am working with are from TCGA.
I see. To go faster you could use:
| LC_ALL=C sort -k1,1 -t ' ' |