Using bowtie2 I want to get only reads that align to animal1 genome and not animal 2. I alligned to animal1, with high sensitivity. Then I took the aligned .sam file and converted it to fastq using gatk SamToFastq (then would align to animal2 and take unaligned reads -> which would give me only animal1 reads).
Problem is that bowtie2 throws an error when try to align to animal2 when using this fastq file:
Error reading RefRecord offset from FILE
Looking at https://github.com/BenLangmead/bowtie2/blob/master/ref_read.h
RefRecord(FILE *in, bool swap) {
assert(in != NULL);
if(!fread(&off, OFF_SIZE, 1, in)) {
cerr << "Error reading RefRecord offset from FILE" << endl;
throw 1;
}
if(swap) off = endianSwapU(off);
if(!fread(&len, OFF_SIZE, 1, in)) {
cerr << "Error reading RefRecord offset from FILE" << endl;
throw 1;
}
if(swap) len = endianSwapU(len);
first = fgetc(in) ? true : false;
}
The issue appears because there is just an id in the output .sam (shown converted to fastq) below:
@SRR10111187.1.1
GTTTATTAGTACGTTGAGGTTGTGATCCGGAGTTTTCGGGGTATGGGCAACCTAGCTTGCTTAGCTGACCTTATTATAGATTGTGGTGTAAGTT
But bowtie want additional information, not just id.
Can I just use the original raw fastq file, get the matching read id, then copy the rest of the read header information and replace the read header in the fastq file output from bowtie2?
Eg
@SRR10111187.1.1 1 length=150 GTTTATTAGTACGTTGAGGTTGTGATCCGGAGTTTTCGGGGTATGGGCAACCTAGCTTGCTTAGCTGACCTTATTATAGATTGTGGTGTAAGTT
I presume this is valid as reads as far as I know are not altered by bowtie2?
Use a proper tool like
bbsplit.sh
to align data to both genomes at the same time. See: BBSplit syntax for generating builds for the reference genome and how to call different builds.Brilliant, thanks
I have run BBSplit using ambiguous2=split and ambiguous2=toss > but am getting really anomalous results wherby one animal genome is dominating, which contradicts Krona analysis and mitochondrial analysis. Will run alignment methid as recommended below to QC
With human and mouse genomes (which are very similar) this is not unexpected. I am not sure about mitochondrial analysis you are talking about but changing aligners/parameters is not going to change this result. What kind of an experiment is this? Are you working with xenografts or is this simple contamination.
The dataset is contaminated with human reads, I want to have high certainty in identifying only human reads- want to be certain on contamination quatity and typing. If there is any similarity to mouse I do not want to include the read. (Mitochondrial analyses as a separate anaysis where I took human and mouse mitochondrial sequence and aligned entire dataset to each seperately using bwa mem). Krona/STAT taxonomic anaysis I did on NCBI.
I think easiest way may be is to run bowtie2 2x once for mouse, once for human. Then use a shell script to find all non matching ids in the sam files and write to new file
If a read aligns perfectly to mouse, and imperfectly to human, you really want to throw it away? I think most people would say a read that aligns better to mouse should be classified as mouse.
I want to count human and mouse seperately. The easiest way for me to work this out is to align twice
Since human and mouse genomes have areas of homology if you align the reads separately you would be double counting some reads.
I actually found that using bowtie2 with very hgh sensitivity gave very few matches. I ended up using bbsplit for final analysis
It might be easiest. Sometimes, the easiest way is rather inaccurate.