Faster alternative to Prinseq PE trimming polyN
1
0
Entering edit mode
8.7 years ago
umn_bist ▴ 390

I am using BBDuk for quality/adapter trimming and filtering a min. length of 40 bp for my RNAseq PE tumor/normal samples.

I am also trimming polyN reads (that are at least 75% of the read) for fear of them uniquely mapping.

My issue with prinseq is that I have to sort my PE files which takes a few hours for a single file.

#paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}" 
#paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}" 

perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 \
-custom_params "A 75%;T 75%;G 75%;C 75%" -min_len 40 -out_format 3 \
-out_good "${file1%_1.fastq}_tst" -out_bad null -log

Are there any (faster) alternatives for this purpose?

RNA-Seq prinseq BBDuk • 2.1k views
ADD COMMENT
0
Entering edit mode

why do you sort the fastq file ?

ADD REPLY
0
Entering edit mode

From PRINSEQ FAQ

PRINSEQ requires sorted input files for paired-end or mate-pair data processing. If your two FASTQ files of a paired-end (or mate-pair) dataset need to be sorted by their sequence identifiers, you can use the following one-liner in Linux/Unix/OSX:

paste - - - - < file_1.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_1_sorted.fastq
paste - - - - < file_2.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_2_sorted.fastq

This will first join the 4 lines (paste - - - -) of a FASTQ entry into a single line (with each of the 4 original lines separated by tabs), then sort them by their sequence identifier (-k1,1 -t " " specifies everything before the first space for the sorting, which is our sequence identifier), and write each entry again in 4 lines by replacing the tabs with line breaks. The sorted entries are then saved in a new file specified after the ">" sign.

The files I am working with are from TCGA.

ADD REPLY
1
Entering edit mode

I see. To go faster you could use:| LC_ALL=C sort -k1,1 -t ' ' |

ADD REPLY
2
Entering edit mode
8.7 years ago

Form you URL i don't think you need to sort your fastq files, if you know they are already paired.

ADD COMMENT
0
Entering edit mode

This will be a huge, HUGE time saver. For future reference, if the files are already paired, can I assume that they are sorted?

ADD REPLY
1
Entering edit mode

if the files are already paired, can I assume that they are sorted

yes. Just check

paste <(paste - - - - < file_1.fastq | cut -f 1) <(paste - - - - < file_2.fastq  | cut -f 1 ) 

the two columns should have the same ID (modulo the /1 and /2 prefixes )

ADD REPLY
0
Entering edit mode

I accepted the answer. Once again, thank you, thank you for the help. The command will be very useful!

ADD REPLY

Login before adding your answer.

Traffic: 1600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6