Entering edit mode
8 months ago
marco.barr
▴
150
Hello everyone, I am working on extracting the longest possible reads (total length 5698bp) from a fastq file (long reads from nanopore) while discarding those with lengths equal to 20bp and 498bp. I need to use Biostrings. I have written these commands:
fastq <- readDNAStringSet("HEVA_long.fastq", format = "fastq")
reads_to_remove <- which(width(fastq) == 20 | width(fastq) == 498)
filtered_fastq <- fastq[-reads_to_remove]
Do you have any suggestions for me? Is it correct? Alternatively, for instance using seqkit in bash, how could I achieve this? Thank you very much.
What is the odd requirement with 20 and 498? Are those adapters or some such?
You can use
reformat.sh
from BBMap suite to filter reads based on size. You can uselhist=filename
to plot a histogram of read distribution.The reason is that on IGV I have very unbalanced peaks of coverage at these base lengths. I just wanted to clean up the IGV visualization because I have to give a presentation. So, I wanted to try this. Does this filter out all reads shorter than 498 base pairs?