My job was not a regular sequence to genome alignment but I was trying to use STAR to accomplish it. Here is the stiuation: I have a reference "genome" comprised of 12415 barcodes with fixed length 18 bp each. And the sequences that I was trying to alignment to this "genome" are 900,000 sequences also with fixed length 18 bp. I created the STAR index of the 12415 barcodes and run the STAR alignment using the below command:
runDir=L18
mkdir $runDir
cd $runDir
indexDir=/mnt/data1/workspace/Alvarez/STAR/Index/Cellecta18
fastq='/mnt/data1/workspace/Alvarez/STAR/SEQ/BRC_1_1-2_TAGCTTA_L001_R1_001-L18.fastq'
STAR --genomeDir $indexDir --readFilesIn $fastq --runThreadN 6 \
The last few lines in Log.out showed as:
12414 16382 18 397216
12415 16383 18 397248
Started loading the genome: Mon Jun 13 14:35:40 2016
checking Genome sizefile size: 397280 bytes; state: good=1 eof=0 fail=0 bad=0
checking SA sizefile size: 1843631 bytes; state: good=1 eof=0 fail=0 bad=0
checking /SAindex sizefile size: 382371 bytes; state: good=1 eof=0 fail=0 bad=0
Read from SAindex: genomeSAindexNbases=8 nSAi=87380
nGenome=397280; nSAbyte=1843631
GstrandBit=32 SA number of indices=446940
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 397280 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 397280 bytes
SA file size: 1843631 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 1843631 bytes
Loading SAindex ... done: 382371 bytes
Finished loading the genome: Mon Jun 13 14:35:40 2016
alignIntronMax=alignMatesGapMax=0, the max intron size will be approximately determined by (2^winBinNbits)*winAnchorDistNbins=589824
winBinNbits=16 > genomeChrBinNbits=4 redefining:
winBinNbits=4
Created thread # 1
Created thread # 2
Created thread # 3
Created thread # 4
Created thread # 5
But my question is why Alignment.out.sam is empty:
@HD VN:1.4
@SQ SN:1 LN:18
@SQ SN:2 LN:18
...
@SQ SN:16383 LN:18
@PG ID:STAR PN:STAR VN:STAR_2.5.2a CL:STAR --runThreadN 6 --genomeDir /mnt/data1/workspace/Alvarez/STAR/Index/Cellecta18 --readFilesIn /mnt/data1/workspace/Alvarez/STAR/SEQ/BRC_1_1-2_TAGCTTA_L001_R1_001-L18.fastq
@CO user command line: STAR --genomeDir /mnt/data1/workspace/Alvarez/STAR/Index/Cellecta18 --readFilesIn /mnt/data1/workspace/Alvarez/STAR/SEQ/BRC_1_1-2_TAGCTTA_L001_R1_001-L18.fastq --runThreadN 6
(END)
Any suggestions?
What are you actually trying to achieve with this? Is this some attempt at demultiplexing? Are the barcodes inline? I can't say I'm surprised that an aligner doesn't work well when you try to align reads to very short barcodes.
(gets the popcorn ready)
EDIT: Although too be fair to OP, demultiplexing non-standard barcodes is gross. BBMap seemed to do a good job for me, but honestly I don't feel like this is a well-solved problem in Bioinformatics just yet. Ideally i'd like to load a FASTA, have it detect my barcodes (hard..) and then split on them whilst making it easy to set the False +ve/-ve parameters for the data (also hard). Until then I guess people will think of ingenius ways to use what they've got.
Sounds like @OP is trying to count how many reads there are for each of those barcodes. Isn't this something easily done using a hash?
Yup, a bit or perl or python would likely suffice.
Maybe i've over thinking this, but if the OP has just taken the first 18bp of each read, then the first sequenced base is not always the first base of the barcode, right? (or maybe i'm wrong here)
I always thought that was the "barcode problem" in a nutshell, otherwise finding barcodes would be easy-peasy. But because barcodes don't always start on the first sequenced base, and because of sequencing errors, they have decided to use an aligner to get around these issues - which is ingenious but not the right way to do it.
More importantly, why would you hash DNA that's only 18bp long? ;-)
That's why I asked about inline barcodes. You're correct that standard barcodes aren't included in reads (they're literally a separate read, so unless your insert size is short you'll normally not see them).
The barcodes would get hashed, not the DNA sequences :)
Regarding aligning them, we typically get ~98% perfect matches of barcodes on the HiSeq (a bit lower on the NextSeq and I'd have to sort through my emails to find the last MiSeq run). So hashing turns out to be a pretty quick and easy way to go about thing with little information loss. For single-cell stuff with inline barcodes on one read I would imagine that a similar procedure is done, though I'd have to ask the single-cell folks about that.
My purpose was not only trying to align sequences to short barcodes but also comparing the STAR program with Bowtie2. I have done this process using Bowtie2 but got this issue from STAR program.