I have RNA-seq data of a wheat translocation line which has its chromosome short arm replaced by another wild variety chromosome short arm. There is no genomic or transcriptomic information available for that wild variety. I want to see what kind of genes are expressing from the new short arm. For this I was thinking if I can do denovo assembly of the unmapped reads and then blast the assembled transcripts with nr database. I want to know if this strategy is correct or not or if there is a better way to do it.
update: number of unmapped read pairs is around 8 million, total reads were around 33 million and I used the following parameters in hisat2
hisat2 -x wheat_hisat_index -1 13R_R1.fastq -2 13R_R2.fastq -S 13.sam -p 12 --mp 8,4 --rdg 7,5 --rfg 7,5 --un-conc 13_unmapped_pairs
Provided you have an adequate amount of reads left over to do a reasonable assembly. What fraction of data has been left over? You are only going to capture genes that are unique for that short arm.
I also have the same doubt. Currently the job is running, Soon I will update the number of unmapped reads left and my aim is to capture only the unique genes.
I've updated the question.
Good point, I gave you my answer below but it holds only as long as @genomax's point is fulfilled.
To the updated question:
your mismatch penalty seems to be quite high compared to what the threshold for alignment scores is if you don't touch the scoring function.
Increasing penalties for gaps and mismatches obviously results in more unmapped reads.
I did not understand the --score-min parameter very well from the manual. Can you suggest me some better values?
There is no such thing as "better values". Parameters have to be chosen according to your data. From the manual:
What is your maximum read length after trimming? Let's say it's 100 nt for example. In this function, this would be multiplied by the third parameter, and after the multiplication the second parameter will be added. The final value is the threshold for the alignment score that is applied to keep or discard a mapping result of a read. That score is negative and every time you have a mismatch or a gap the algorithm will subtract the penalties that you specified for gaps and mismatches.
I think you should calculate the alignment score threshold in a way that you get what you want. For example, if you want to call variants between strains of the same species, you might allow very few mismtaches because you know that the genomes are very similar. If you're mapping reads from different species to a reference, you might allow more gaps/mismtaches because biologically speaking they're expected.
Thanks Macspider, I got it.