combined paired end reads from different species
2
0
Entering edit mode
7.4 years ago
Star70 • 0

I have a set of paired end reads (RNAseq data) from three unknown species. In fact, the data of these three species are combined and we do not know which read is for which species. how can I find these species with these reads?

RNA-Seq • 2.5k views
ADD COMMENT
3
Entering edit mode

If it's a microbial sample, run it through Kraken.

ADD REPLY
2
Entering edit mode

That is really strange scenario but if you want to find the species of the RNAseq sample, maybe you could run DIAMOND to classify each read from the sample. However, DIAMOND does not support paired end data but you can run each read file separately.

ADD REPLY
1
Entering edit mode

Also it is advisable to formulate a useful title for the question to ensure that it reflects the actual question being asked as the title you used for this question does not really summarise the question here.

ADD REPLY
0
Entering edit mode

Thanks for your guidelines. I edited my question and its title to show my main goal.

ADD REPLY
0
Entering edit mode

What I used is PEAR to merge pair-end RNA-seq reads: https://cme.h-its.org/exelixis/web/software/pear/

pear -f 16_S3_L001_R1_001.fastq.gz -r 16_S3_L001_R1_001.fastq.gz -o 16_S3
ADD REPLY
2
Entering edit mode
7.4 years ago
h.mon 35k

First approach:

Use Kraken, CLARK or Centrifuge to classify your reads. Centrifuge provides on its site a pre-indexed NCBI nt database, so may be the best choice here (indexing takes a long time and needs a good amount of memory). Then use KronaTools to explore the taxonomic distribution of the reads.

Second approach:

Use Tadpole to assemble a draft transcriptome, then DIAMOND (blastx search) against nt database - you will have to download the fasta and build the index. Finally, use KronaTools to explore the taxonomic distribution of the transcripts.

ADD COMMENT
1
Entering edit mode

Another possibility for taxonomic classification, which can be don on the raw reads, is BBSketch:

sendsketch.sh in=reads.fq reads=1m nt

That will compare the reads in the file to nt (you can alternatively use the flag "refseq", which is a bigger database, but less-well curated). It will only take a few seconds.

ADD REPLY
0
Entering edit mode
7.4 years ago
GenoMax 147k

Why not take a few reads and blast them at NCBI? Wording of the original question seems to indicate that there is only one species involved.

Edit: New information adds that there are 3 species involved though we don't know if their identity is known. If it is known then this becomes a simple problem of using bbsplit.sh from BBMap suite with the three references to bin the reads into 3 pools.

ADD COMMENT
0
Entering edit mode

I am so thanks for your answer. But, these reads come from three different species and we do not know which one is from which species. Now I need to know these three species.

ADD REPLY
0
Entering edit mode

Three is not too bad. I suggest that you take a random sample of reads and blast them at NCBI. All other options suggested here are going to require you to go through several hoops and can require significant local compute resources.

Note: Can you specify if the species are related to each other or are completely different organisms.

ADD REPLY
0
Entering edit mode

Is it all from the same sequencing run? Can you not separate out the 3 species by their FASTQ headers for each run perhaps?

ADD REPLY

Login before adding your answer.

Traffic: 1567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6