Hi All,
I am attempting to identify novel ncRNAs from a circadian RNAseq dataset. Specifically I have a ribo-depleted RNAseq timecourse with 31 samples (sample every 2 hours for 60hrs). I have run STAR (code below). I am trying to follow the below pipeline, but am confused about how to ACTUALLY go about doing it. The pipeline comes from this publication (https://pubmed.ncbi.nlm.nih.gov/25349387/).
STAR --genomeDir star --readFilesCommand zcat --readFilesIn Rep2_Data/2_1.fq.gz Rep2_Data/2_2.fq.gz --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --runThreadN 16 --outFileNamePrefix Rep2_star_output/STAR_TP2/
Pipeline trying to follow:
1. Collect all reads that map across splice junctions (i.e., reads with large gaps in their alignments).
Reads falling into this class are identified by STAR during alignment and stored in files with the SJ.out.tab extension. To reduce the impact of spurious reads and noise, I will required that splice junctions be mapped by a minimum of 62 reads across the entire dataset (this threshold corresponds to two reads per time point for 31 time points). <- I have the SJ.out.tab files (a total of 31 files) and have looked at them and see the column that specifies number of reads mapping to a specific splice junction, but have no idea how to filter the SJ.out.tab file so I can proceed to the next step. I'm guessing all SJ.out.tab files need to be merged? I'm really not sure.
2. Next, I will use BEDTools to filter out any junction mapping within 1 kb of any Ensembl or RefSeq transcript, or overlapping with any NONCODE transcript.
I'm not sure if anyone is going to be able to help me with this but I thought I would give it a try! Any help is appreciated.