Entering edit mode
5.7 years ago
max_19
▴
170
Hi all!
I have a script which goes through a bunch of fastq files and searches for specific sequences found in a text file called test_text.txt, then outputs a fastq file with only those reads. The reads in my fastq files are all between 15-30bp long, however when I look at the output fastq files, I am getting reads of 52 bp long.
Does anyone know why this would happen? code is below:
for fastq in `ls my_files`;do
for i in `cat test_text.txt`; do zgrep -B 1 -A 2 $i $fastq >>filtered_fastq/${fastq}.filtered_reads.fastq; done;
done
Thanks for any input.
What have you done so far to convince yourself/check the reads are all 15-30bp?
I can't see an obvious mistake at first glance, so are you sure there are no 52 bp reads in the input file?
I used cutadapt to set min/max length for all sequences (cutadapt -a $adapter3 -f fastq -m 15 -M 30). also when I do fastqc on the file, my sequence length in the basic stats section shows '15-30'
Make sure; do
awk '{print length}' <filename> | sort -n | tail
. The last number should be the longest sequence length in your file. Remove the headers before if you think they're longer than 30bp (i.e.grep -v ">" <filename> | awk '{print length}' | sort -n | tail
Thank you! Tried this and got 54. This is odd! does this mean something is wrong with my cutadapt step? and why does FASTQC report 15-30bp for sequence length
cutadapt should get rid of reads > 30 in length in that case. Are you sure that you have the trimmed file in
my_files
, and not the original file you had before running cutadapt?By the way, the
-f
switch in grep allows you to provide a file of patterns to search; e.g.grep -f <file_with_patterns> <file_to_search_in>
, so possibly your could get rid of your inner for loop.Good to know, thank you!