Entering edit mode
2.2 years ago
Ngrin
•
0
Hello,
I have some fasta files and I want to align them against human genome (GRCh38). Since I have RNA seq data I am trying to use STAR. In the basic command there is a flag named --sjdbOverhang. As I read the value of it depends on the read length. I found the following command:
awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' ./data/D001.fastq.gz
and I get read lengths as an array. These numbers vary between 4370 to 1. So what should I consider for this flag?
You run awk on a gzipped file. There are probably no reads with length 1 or 4370. Usually it’s 150 these days for Illumina. For that command you habe to decompress first, for example zcat and pipe that into awk. Or just take the first few reads and look at it with head or less. Should be the same for all.
Thanks @ATpoint for your help, I have unzipped it and run the command again. The output is one line as:
So does it mean the read length is 60 (and the 33858751 shows the number of reads)? if yes the flag value of --sjdbOverhang should be set as 59? I am sorry if my question is naive. I am very new to this area.
Yes, that will probably work.