Entering edit mode
5.3 years ago
kerenyifarkas
•
0
Hi,
I want to filter my reads to have only 29-31 nt long ones. I tried these three methods:
## 1:
awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 29 && length(seq) <= 31) {print header, seq, qheader, qseq}}' < in.fq > out_filtered.fq
## 2:
reformat.sh in=in.sam out=out_filtered.sam minlength=29 maxlength=31
## 3:
seqkit seq in.fq -j 10 -m 29 -M 31 > out_filtered.fq
At the end, none of them had reads longer than 31 nt but all of them had some shorter than 29 nt. Around 5% of the filtered reads are 28 nt long and there are less than 1% of them with less than 28 nt length.
Why is this happening and how can I fix this?
Or is it possible that the scripts 'psite' and 'phase_by_size' from the plastid package is doing something wrong to give this result?
Here is the link for two pics of results of 'psite' and 'phase_by_size': https://photos.app.goo.gl/sGhT2dvoNsAQHF7d7
Thanks.
Farkas
To me, the
awk
seems to do the job (I didn't check the other methods).For control, you could do
{print length(seq), seq}
.And check manually the corner cases (e.g. len == 29).
Do you really convert the
out_filetered.fq
to*.bam
and runphase_by_size
on that?Agreed, you should check manually if this tool is not reporting something odd.
I did:
Yes, checking manually it gives only reads 29-31 nt.
out_filtered.fq
files were mapped and thepsite
andphase_by_size
were run on the generated.bam
files.So now the question is why these scripts show this weird results... Maybe I should just skip this
plastid
stuff and use something else instead.Thanks!
Can you try following two step procedure with BBtools?
I tried. Gives the same results.
Following works for me using @ATPoint's test file.
OR