Entering edit mode
6.1 years ago
Jusnib
•
0
I need to extract promoter and gene sequences of few (5) genes from more than 100 soybean lines. However, I have only raw reads of the genome. Mapping the reads of all the lines to soybean genome will take very long time. Is there any other quick way to extract those sequences?
To extract the gene/promoter sequences from you raw reads, you have to map them on some reference. Here reference does not mean genome all the time.
You can make your own customized reference database from the interested gene/promoter sequences and then using any sequence alignment tools (I would suggest short read aligner like BWA, bowtie and bowtie2 as you have raw sequencing reads), you can map your raw reads on such small customized database to your save time.
At the end of the alignment, you will get the gene/promoter sequences from your raw reads which are similar to the customized database(gene/promotor database).
Perhaps BLAST?
If I understand correctly, you have sequences for 5 genes and you want to extract all of the WGS reads that map to these genes. Am I correct? If so, BLAST is probably your best option. If the data are already in SRA then it would be even easier as you can use the web BLAST and use your gene sequence as query against the WGS SRA project as the subject database. If the data are not in SRA then you can run BLAST locally.
Is there any particular reason why you don't want to assemble the reads first?
These are not my data, I got these raw reads from our collaborator. I need sequences of few genes and If possible, I would like to avoid spending time in assembling the reads. If there is no other way I will assemble the reads.
You could also pseudo-align to the FASTA mRNA sequences for the genes of interest using Kallisto or Salmon, produce a pseudobam from this pseudo-alignment, and then extra the reads that have aligned from the BAM. Be aware of the biases in these steps, though.
Otherwise, assemble the genome and generally follow steps by Nitin Narwade.