Star Create genome index files

0

Entering edit mode

2.3 years ago

Ngrin • 0

Hello,

I have some fasta files and I want to align them against human genome (GRCh38). Since I have RNA seq data I am trying to use STAR. In the basic command there is a flag named --sjdbOverhang. As I read the value of it depends on the read length. I found the following command:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' ./data/D001.fastq.gz

and I get read lengths as an array. These numbers vary between 4370 to 1. So what should I consider for this flag?

genome Alignment STAR • 853 views

ADD COMMENT • link updated 2.3 years ago by ATpoint 86k • written 2.3 years ago by Ngrin • 0

1

Entering edit mode

You run awk on a gzipped file. There are probably no reads with length 1 or 4370. Usually it’s 150 these days for Illumina. For that command you habe to decompress first, for example zcat and pipe that into awk. Or just take the first few reads and look at it with head or less. Should be the same for all.

ADD REPLY • link 2.3 years ago by ATpoint 86k

0

Entering edit mode

Thanks @ATpoint for your help, I have unzipped it and run the command again. The output is one line as:

60 33858751

So does it mean the read length is 60 (and the 33858751 shows the number of reads)? if yes the flag value of --sjdbOverhang should be set as 59? I am sorry if my question is naive. I am very new to this area.