Question

Faster alternative to Prinseq PE trimming polyN

0

Entering edit mode

8.7 years ago

umn_bist ▴ 390

I am using BBDuk for quality/adapter trimming and filtering a min. length of 40 bp for my RNAseq PE tumor/normal samples.

I am also trimming polyN reads (that are at least 75% of the read) for fear of them uniquely mapping.

My issue with prinseq is that I have to sort my PE files which takes a few hours for a single file.

#paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}" 
#paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}" 

perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 \
-custom_params "A 75%;T 75%;G 75%;C 75%" -min_len 40 -out_format 3 \
-out_good "${file1%_1.fastq}_tst" -out_bad null -log

Are there any (faster) alternatives for this purpose?

RNA-Seq prinseq BBDuk • 2.1k views

ADD COMMENT • link updated 8.7 years ago by Ram 44k • written 8.7 years ago by umn_bist ▴ 390

0

Entering edit mode

why do you sort the fastq file ?

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

From PRINSEQ FAQ

PRINSEQ requires sorted input files for paired-end or mate-pair data processing. If your two FASTQ files of a paired-end (or mate-pair) dataset need to be sorted by their sequence identifiers, you can use the following one-liner in Linux/Unix/OSX:

paste - - - - < file_1.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_1_sorted.fastq
paste - - - - < file_2.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_2_sorted.fastq

This will first join the 4 lines (paste - - - -) of a FASTQ entry into a single line (with each of the 4 original lines separated by tabs), then sort them by their sequence identifier (-k1,1 -t " " specifies everything before the first space for the sorting, which is our sequence identifier), and write each entry again in 4 lines by replacing the tabs with line breaks. The sorted entries are then saved in a new file specified after the ">" sign.

The files I am working with are from TCGA.

ADD REPLY • link 8.7 years ago by umn_bist ▴ 390

1

Entering edit mode

I see. To go faster you could use:| LC_ALL=C sort -k1,1 -t ' ' |

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

score 2 · Accepted Answer · 2016-03-07

2

Entering edit mode

8.7 years ago

Pierre Lindenbaum 164k

Form you URL i don't think you need to sort your fastq files, if you know they are already paired.

ADD COMMENT • link 8.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

This will be a huge, HUGE time saver. For future reference, if the files are already paired, can I assume that they are sorted?

ADD REPLY • link 8.7 years ago by umn_bist ▴ 390

1

Entering edit mode

if the files are already paired, can I assume that they are sorted

yes. Just check

paste <(paste - - - - < file_1.fastq | cut -f 1) <(paste - - - - < file_2.fastq  | cut -f 1 )

the two columns should have the same ID (modulo the /1 and /2 prefixes )

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I accepted the answer. Once again, thank you, thank you for the help. The command will be very useful!

ADD REPLY • link 8.7 years ago by umn_bist ▴ 390