Tool to separate human and mouse rna seq reads
4
5
Entering edit mode
9.5 years ago
Ron ★ 1.2k

Hi,

I have PDX RNA-seq data.I want to separate the reads of mouse and human,before comparing with tumor RNA-seq data. Is there any tool to do that?

After removing the mouse reads I want to do differential expression between primary tumors and PDX tumors.

Thanks,
Ron

RNA-Seq next-gen-sequencing • 15k views
ADD COMMENT
1
Entering edit mode

Sorry, I have a problem with BBsplit.

I have xenograft mouse-human rna-seq samples and I had thought to using BBSplit to delete the mouse contamination.

So I used this command line:

bbsplit.sh in1=reads1.fq in2=reads2.fq ref=human.fa,mouse.fa ambiguous2=toss basename=out_%.fq refstats=Statistics_%.txt

Than, I have remapped the output fastq file for the human reference with STAR and then I would like to use FeatureCount to reconstruct the raw count of the genes, but it doesn't work well.

Can you recommend a pipeline to follow for RNA-seq data after using bbsplit?

Thanks so much for the reply.

ADD REPLY
0
Entering edit mode

STAR + featurecount/htseq-count is a very good pipeline to use.

ADD REPLY
0
Entering edit mode

I have a personal thought, if it's not correct then I'm sorry.

If you know which is the mouse and which is the human in the data, you can write shell split directly. If you don't know, you can compare it first and then split it after comparing it twice.

5
Entering edit mode
9.5 years ago
ethan.kaufman ▴ 380

Xenome works pretty well.

ADD COMMENT
0
Entering edit mode

the page seems to have moved.

ADD REPLY
0
Entering edit mode

Hmm, that's unfortunate. Maybe you can try contacting the corresponding author? Another approach would simply be to align first to the mouse reference, and then use the unmapped reads for your analysis, but as I recall Xenome uses a more sophisticated filtering strategy.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

This link too is now dead. Anyone have a new source?

ADD REPLY
0
Entering edit mode

appears to be bundled here https://github.com/data61/gossamer

if it does not work (recent build, so should be OK), I have the original binaries that I downloaded from the NICTA site.

ADD REPLY
0
Entering edit mode

Klugman, I downloaded the current version from Github. I'm facing the following issue - https://github.com/data61/gossamer/issues/9. Could you please share the original binaries you downloaded from NICTA?

ADD REPLY
3
Entering edit mode
9.5 years ago

BBSplit will map the reads simultaneously to multiple references, and output in multiple files, one per reference. Since you're looking at RNA-seq of tumors, you would want to increase the sensitivity over the default, using settings like (if mapping to the genomes):

bbsplit.sh in=reads.fq ref=hg19.fa,mm10.fa minratio=0.5 maxindel=100000 minhits=1 basename=out_%.fq ambig2=all local

If mapping to transcriptomes, maxindel=500 would probably be better than 100000.

ADD COMMENT
0
Entering edit mode

I have paired end reads in format of fastq.gz so how can I input them.

Also In this case I don't have to index hg19.fa and mm10.fa?

ADD REPLY
1
Entering edit mode

BBSplit will automatically index them when you give it that command, if they were not already indexed. Also, it will accept fastq.gz as input or output; you don't need to unzip them.

ADD REPLY
0
Entering edit mode

The command produced two folders namely genome and index.The genome folder has these files:

chr10.chrom.gz
chr11.chrom.gz
chr12.chrom.gz
chr1.chrom.gz
chr2.chrom.gz
chr3.chrom.gz
chr4.chrom.gz
chr5.chrom.gz
chr6.chrom.gz
chr7.chrom.gz
chr8.chrom.gz
chr9.chrom.gz
info.txt
merged_ref_6708417211327.fa.gz
namelist.txt
reflist.txt
scaffolds.txt.gz
summary.txt

The index folder has these files:

chr12_index_k13_c2_b1.block
chr12_index_k13_c2_b1.block2.gz
chr1-3_index_k13_c2_b1.block
chr1-3_index_k13_c2_b1.block2.gz
chr4-7_index_k13_c2_b1.block
chr4-7_index_k13_c2_b1.block2.gz
chr8-11_index_k13_c2_b1.block
chr8-11_index_k13_c2_b1.block2.gz

I was expecting paired end reads for mouse and paired end reads for human.

ADD REPLY
2
Entering edit mode

Hi Ron,

The contents of /ref/genome/ and /ref/index/ are just the index; don't worry about that. The input reads should have been specified as in=reads.fq, or for paired reads named read1.fq and read2.fq, in=read#.fq Then if you set basename=out_%.fq your output paired reads would be out_hg19.fq and out_mm10.fq.

Those will be paired and interleaved. You can deinterleave them with reformat.sh if you want them in two files. Are the files out_hg19.fq and out_mm10.fq present? They should be in the working directory you were in when you ran the comand (not in /ref/).

ADD REPLY
0
Entering edit mode

Yes.I got the two files out_hg19.fq and out_mm10.fq.

Now I have interleaved them using reformat.sh script and then I compressed the reads using gzip.Thanks Brian.

ADD REPLY
0
Entering edit mode

I am getting this error while running a parallel job.It worked well for a single sample.Do I have to use -Xmx3g option ?if yes what would be the appropriate memory for java?

Could not create the Java virtual machine.
Error occurred during initialization of VM
Could not reserve enough space for object heap
java -Djava.library.path=/rob2056/software/bbmap/jni/ -ea -Xmx52753m -cp /rob2056/software/bbmap/current/ align2.BBSplitter build=1 ow=t fastareadlen=500 minhits=2 minratio=0.9 maxindel=20 trim=rl untrim=t in1=/tmp/419629.1.standard.q/inpu
t/Sample_ZA30_cells_R1.fastq.gz in2=/tmp/419629.1.standard.q/input2/Sample_ZA30_cells_R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa minratio=0.5 maxindel=100000 minhits=1 basename=/tmp/419629.1.standard.q/dest/Sample_ZA30_cells_R1.fastq_%.fq ambig
2=all local
ADD REPLY
0
Entering edit mode

If you want to run multiple copies of BBSplit simultaneously on the same machine, for mouse and human reference together, you can include the flag -Xmx48g. There is not much point in running it simultaneously, though, because BBSplit is multi-threaded and automatically uses all available cores (unless you set the number of threads with e.g. t=4). So, 2 instances at once on the same machine will not go any faster than 2 instances one after the other.

ADD REPLY
0
Entering edit mode

I am running on SGE cluster with gnu parallel as I have 20 samples.The only thing is I want to run all samples in parallel. I have given -Xmx20g as of now, if not then will give 48g.

ADD REPLY
0
Entering edit mode

Hi Brian,

I am running BBmap on 2 samples which have fastq of sizes 50G each (paired end). These have been running over since 4 days.Is there any way to make the process faster?

bbmap/bbsplit.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa,/mm10.fa minratio=0.5 maxindel=500 minhits=1 basename=$TMPDIR/dest/${input_file%R1*}_%.fq ambig2=all local
ADD REPLY
0
Entering edit mode

Ron,

How many threads are you using? This should really not take very long unless you're restricting it to a single thread or something. Also, are you sure it has not crashed? Run "top" to verify that it is still consuming CPU resources, and check the output to stderr as well - if it crashes, it will print some kind of "Exception".

Adding the "fast" flag and increasing minratio will make it go faster, at the expense of reduced sensitivity. Alternately, you can now split datasets between references with Seal, which is WAY faster but requires more memory (it uses kmer-matching rather than alignment). The amount of memory it uses is tunable, though, again with a tradeoff of sensitivity.

ADD REPLY
0
Entering edit mode

Here is the log file:

java -Djava.library.path=/software/bbmap/jni/ -ea -Xmx154418m -cp /software/bbmap/current/ align2.BBSplitter build=1 ow=t fastareadlen=500 minhits=2 minratio=0.9 maxindel=20 trim=rl untrim=t in1=/tmp/487047.1.standard.q/inpu
t/Sample1_R1.fastq.gz in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa minratio=0.5 maxindel=500 minhits=1 basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq ambig2=all local
Executing align2.BBSplitter [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm1
0/genome/mm10.fa, minratio=0.5, maxindel=500, minhits=1, basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq, ambig2=all, local]

Converted arguments to [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, minratio=0.5, maxindel=500, minhits=1, basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq, ambig2=all, local, ref_hg19=/pbtech_mounts/fd
lab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa, ref_mm10=/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa]
Creating merged reference file ref/genome/1/merged_ref_6708417211327.fa.gz
Ref merge time:    173.220 seconds.
Executing align2.BBMap [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, minratio=0.5, maxindel=500, minhits=1, ambig2=all, local, ref=ref/genome/1/merged_ref_6708417211327.fa.gz, out_hg19=/tmp/487047.1.standard
.q/dest/Sample1__hg19.fq, out_mm10=/tmp/487047.1.standard.q/dest/Sample1__mm10.fq]

BBMap version 34.94
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.900
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.500
Retaining first best site only for ambiguous mappings.
NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_6708417211327.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Writing chunk 8
Writing chunk 9
Writing chunk 10
Writing chunk 11
Writing chunk 12
Set genome to 1

Loaded Reference: 0.004 seconds.
Loading index for chunk 1-12, build 1
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr1-3_index_k13_c2_b1.block
Indexing threads started for block 0-3
Indexing threads finished for block 0-3
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr4-7_index_k13_c2_b1.block
Indexing threads started for block 4-7
Indexing threads finished for block 4-7
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr8-11_index_k13_c2_b1.block
Indexing threads started for block 8-11
Indexing threads finished for block 8-11
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr12_index_k13_c2_b1.block
Indexing threads started for block 12-12
Indexing threads finished for block 12-12
Generated Index: 717.384 seconds.
Analyzed Index:   15.050 seconds.
Reads that map to multiple references will be written to all relevant output streams.
Cleared Memory:   9.979 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 4 mapping threads.

Now,I am running with this command after adding threads,fast parameters.I think earlier I was running with one thread only.

software/bbmap/bbsplit.sh \
  in1=$TMPDIR/input/${input_file} \
  in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz \
  ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa \
  minratio=0.56 \
  maxindel=500 \
  minhits=1 \
  basename=$TMPDIR/dest/${input_file%R1*}_%.fq \
  ambig2=all \
  local \
  fast=<f> \
  threads=<4>
ADD REPLY
0
Entering edit mode

Also is there any way to directly generate the compressed fastq files?

ADD REPLY
0
Entering edit mode

Output files will be written compressed if you add the suffix ".gz" to them:

basename=$TMPDIR/dest/${input_file%R1*}_%.fq

Also, are you sure you only have 4 cores? That's not very many for a machine with ~192 gigs of RAM. There's probably a better way but you can normally find out how many logical processors are present with cat /proc/cpuinfo

ADD REPLY
0
Entering edit mode

Since I am working on cluster.I am using -smp as well.

#!/bin/bash -l
#$ -j y
#$ -cwd -S /bin/sh
#$ -l h_vmem=45G
#$ -pe smp 4

Initially this was the reason that I did not added threads to parallelize.

ADD REPLY
0
Entering edit mode

Hi Brian,

So I increased number of threads to 16.The job has been running since yesterday almost 24 hours.The current file size is ~4 gb (human reads fastq) only and the input files are 50G(paired end) each.The parameters of my shell script on cluster are in the previous comment.I think this will take a lot of time.Let me know if you could assist in making this run faster.

ADD REPLY
0
Entering edit mode

I'm not sure why it's so slow... but, you can greatly speed it up like this:

seal.sh in=reads.fq pattern=out_#.fq ref=human.fa,mouse.fa ambig=all k=27 prealloc speed=8 outu=nonmatching.fq threads=16

This will require a little over 60GB RAM (total, not per-node) and run very fast, particularly if you give it 16 threads (Seal is usually around 100x faster than BBSplit, though it takes more memory). It works by counting kmer matches rather than alignment. You can then run BBSplit on just the unmapped reads if you want, or include them with human.

ADD REPLY
0
Entering edit mode
bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa  pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8

Using this I am getting these output files

Sample1__chr10.fq.gz
Sample1__chr11.fq.gz
Sample1__chr12.fq.gz
Sample1__chr13.fq.gz
Sample1__chr14.fq.gz
Sample1__chr15.fq.gz
Sample1__chr16.fq.gz
Sample1__chr17.fq.gz
Sample1__chr18.fq.gz
Sample1__chr19.fq.gz
Sample1__chr1.fq.gz
Sample1__chr20.fq.gz
Sample1__chr21.fq.gz
Sample1__chr22.fq.gz
Sample1__chr2.fq.gz
Sample1__chr3.fq.gz
Sample1__chr4.fq.gz
Sample1__chr5.fq.gz
Sample1__chr6.fq.gz
Sample1__chr7.fq.gz
Sample1__chr8.fq.gz
Sample1__chr9.fq.gz
Sample1__chrM.fq.gz
Sample1__chrX.fq.gz
Sample1__chrY.fq.gz

Am not sure if mouse reads are getting generated here(although the job is still running).

it seems pretty fast though.

ADD REPLY
0
Entering edit mode

Ah, I forgot about that. By default Seal splits on a per-reference-sequence basis. You can use the flag refnames=t to make one output file per reference file instead of one file per reference sequence.

ADD REPLY
0
Entering edit mode

Would I have to merge the fastq of all chromosomes in the end, or will generate a merge fastq for human and merged fastq for mouse?

ADD REPLY
0
Entering edit mode

With refnames=t you would get 2 fastq files, one for human and one for mouse. Without that flag it will produce one file per chromosome and you'd need to merge them. But, Seal makes an assumption that each sequence has a unique name when splitting them by sequence, so if your mouse and human references have sequences with the same name (e.g. chr1) then you need to to re-run it with the refnames=t flag.

ADD REPLY
0
Entering edit mode

Hi Brian,

Thanks much,this worked pretty fast.is there any way you would know of analyzing the mouse reads and getting read counts per gene?This is WGS data.I have done read counts from RNA-seq data using htseq-count.

ADD REPLY
0
Entering edit mode

Hi Brian, is there a publication or a description of this tool available? I will try it in few days. Thanks a lot

ADD REPLY
0
Entering edit mode

There's no publication yet, but there's a description of it here. Also, when you download the BBMap package, there's a guide for Seal (which can be used similarly to BBSplit, but uses kmer-matching instead of mapping) in /bbmap/docs/guides/SealGuide.txt which gives a description of Seal. I haven't written one for BBSplit yet; I should do that.

ADD REPLY
0
Entering edit mode

Hi Brian, How can we output the unmapped reads that don't map to either human or mouse, for further analysis?I am using seal script in bbmap

ADD REPLY
0
Entering edit mode

Use the outu flag, e.g.:

seal.sh in=reads.fq pattern=mapped_%.fq outu=unmapped.fq ref=[references]

ADD REPLY
0
Entering edit mode

Thanks a lot. I try to use it now and I've a new question. Is it possible to use splitted fastq? Or I must concatenate them before?

Here is my command line I have used to try:

../bbsplit.sh in=/raw/fastq/141_CACTTCGA_L003_R#_*.fastq.gz ref=human.fa,mouse.fa basename=out_%_R#.fastq.gz outu1=unmapped1.fastq.gz outu2=unmapped2.fastq.gz threads=20

Thanks a lot in advance

ADD REPLY
0
Entering edit mode

That should work fine. Note that there is a difference between "concatenated" and "interleaved" fastq - never concatenate read1 and read2 files; that would break pairing. If you wanted to turn them into a single file, you would do something like this:

reformat.sh in=r#.fq out=interleaved.fq
ADD REPLY
0
Entering edit mode

Of course, I don't want to concatenate R1 and R2. But, for one sample, I have several fastq splitted for R1, and the same number for R2 (splitted by illumina software casava). I wondering if it is possible to run only one bbplit, but I will launch it several times, in parrallel, and then concatenate results files. Thanks again

ADD REPLY
0
Entering edit mode

Hi Brian, Is there any way to split ribosomal rna reads using this? Let me know.Can we do by including the fast file for rRNA?If yes,How can we get this file?

Thanks, Ron

ADD REPLY
0
Entering edit mode

Hi Ron,

To use BBSplit you must have at least two references; in this case, a ribosomal reference an another for the main genome without ribosomal sequences. If you only have the ribosomal sequence, you can just map to it with BBMap using fairly strict parameters and collect the unmapped reads:

bbmap.sh in=reads.fq ref=ribo.fa maxindel=1 minid=0.95 outm=ribo_reads.fq outu=unmapped.fq

Depending on the situation, if you have sufficient data, and you don't have a ribosomal reference, you could try to assemble the ribosomes on the assumption that they have much higher coverage than the rest. What kind of experiment are you doing, on what kind of organism, and how much coverage do you have? How long are the reads?

ADD REPLY
0
Entering edit mode

Hi Brian,

Just want to look for contamination ,we think that there are lot of reads present as rRNA in our human samples RNA seq data. Although we have sufficient amount of human reads ~40 million mapped but there are still lot of unmapped reads.The reads are 100 bp paired end.I don't have a rRNA fasta file so would have to assemble it.

ADD REPLY
0
Entering edit mode

I found this rRNA fasta file here: http://genomespot.blogspot.com/2015/08/screen-for-rrna-contamination-in-rna.html

Do you think this should be enough?

ADD REPLY
5
Entering edit mode
5.4 years ago

We're testing out XenoSplit right now and it seems to work fairly well. I've also used NGS-disambiguate in the past with success.

ADD COMMENT

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6