Question

Biostrings extract read

0

Entering edit mode

8 months ago

marco.barr ▴ 150

Hello everyone, I am working on extracting the longest possible reads (total length 5698bp) from a fastq file (long reads from nanopore) while discarding those with lengths equal to 20bp and 498bp. I need to use Biostrings. I have written these commands:

fastq <- readDNAStringSet("HEVA_long.fastq", format = "fastq")
reads_to_remove <- which(width(fastq) == 20 | width(fastq) == 498)
filtered_fastq <- fastq[-reads_to_remove]

Do you have any suggestions for me? Is it correct? Alternatively, for instance using seqkit in bash, how could I achieve this? Thank you very much.

R fastq Biostrings • 625 views

ADD COMMENT • link 8 months ago by marco.barr ▴ 150

0

Entering edit mode

What is the odd requirement with 20 and 498? Are those adapters or some such?

You can use reformat.sh from BBMap suite to filter reads based on size. You can use lhist=filename to plot a histogram of read distribution.

reformat.sh -Xmx4g in=read.fq out=filtered.fq minlength=499 lhist=file.lst

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

The reason is that on IGV I have very unbalanced peaks of coverage at these base lengths. I just wanted to clean up the IGV visualization because I have to give a presentation. So, I wanted to try this. Does this filter out all reads shorter than 498 base pairs?

ADD REPLY • link 8 months ago by marco.barr ▴ 150

score 0 · Answer 1 · 2024-03-19

0

Entering edit mode

8 months ago

Pierre Lindenbaum 164k

cat HEVA_long.fastq | paste - - - - | awk -F '\t' '{L=length($2);if(L!=20 && L!=498) print;}' | tr "\t" "\n"