Hey everyone,
I'm pretty new to RNA-seqencing and was wondering if anyone could help me out. I am trying to run a variety of SV callers (STAR-Fusion, etc.) on data from the CCLE (https://portal.gdc.cancer.gov/legacy-archive).
Most SV Callers require .fastq files but all the data I have downloaded is in BAM format. Here are some more details:
Firstly, the BAM files are coordinate sorted, and after realizing that they needed to be sorted by name in order for the paired fastq files to be created correctly, I sorted all files by name
I am using Samtools 1.9.
samtools sort -n infile.bam outfile_sorted.bam
Then:
samtools fastq -1 outfile_sorted_1.fastq.gz -2 outfile_sorted_2.fastq.gz outfile_sorted.bam
Is this process enough in order to feed the .fastq reads into the SV caller? I figured if I filtered out any non-primary reads, that the reads corresponding to fusions would also be filtered out. I'm seeing a LOT of duplicated sequences in my QC reports but I figured that wasn't a problem. I just wanted to make sure that I wasn't keeping a bunch of artifiacts in my .fastq files and potentially making my whole project useless.