I want to extract, from reads of a whole run (ERR949847), only katG related reads. I downloaded, from gene dataset of ncbi, katG related sequence. Here the header of the of the fasta file:
>NC_000962.3:c2156111-2153889 katG [organism=Mycobacterium tuberculosis H37Rv] [GeneID=885638] [chromosome=]
For complete information, the reads of the run which i mentioned before, belong to the same organism: Mycrobacterium tuberculosis H37Rv. In order to achieve my task i follow few steps that seems reasonable to me:
- Send the run to blast, through ncbi site.
- Copy the gene related sequence into "query sequence" input field
- Start the allignment.
This is the output that i got:
>gnl|SRA|ERR949847.2068134.2 2068134
TGTCCCAGGCAGCGACGAAGTCCTGCACGAACTTCGGCTGCGCGTCATCGGCGCCATAGACCTCGACAAG
CGCCC
>gnl|SRA|ERR949847.2068134.1 2068134
TGGCAAGGTGAAGTGGACCGGCAGCCGCGTGGACCTGGTCTTCGGGTCCAACTCGGAGTTGCGGGCGCTT
GTCGA
>gnl|SRA|ERR949847.1529974.2 1529974
CTTGTACCAGGCCTTGGCGAACTCGTCGGCCAATTCCTCGGGGTGTTCCAGCCAGCGACGCGTGATCCGC
TCATA
>gnl|SRA|ERR949847.1529974.1 1529974
GGCCACTGACCTCTCGCTGCGGGTGGATCCGATCTATGAGCGGATCACGCGTCGCTGGCTGGAACACCCC
GAGGA
>gnl|SRA|ERR949847.1388549.2 1388549
TGTCCCAGGCAGCGACGAAGTCCTGCACGAACTTCGGCTGCGCGTCATCGGCGCCATAGACCTCGACAAG
CGCCC
>gnl|SRA|ERR949847.1388549.1 1388549
TGGCAAGGTGAAGTGGACCGGCAGCCGCGTGGACCTGGTCTTCGGGTCCAACTCGGAGTTGCGGGCGCTT
GTCGA
>gnl|SRA|ERR949847.1227510.2 1227510
CTTGTACCAGGCCTTGGCGAACTCGTCGGCCAATTCCTCGGGGTGTTCCAGCCAGCGACGCGTGATCCGC
TCATA
>gnl|SRA|ERR949847.1227510.1 1227510
GGCCACTGACCTCTCGCTGCGGGTGGATCCGATCTATGAGCGGATCACGCGTCGCTGGCTGGAACACCCC
GAGGA
>gnl|SRA|ERR949847.1100314.2 1100314
GACGCATCCGTGCGGCCCGGGGTGAAGGGCACCGTGATGTTGTGGCCAGCCGCCTTTGCTGCTTTCTCTA
TGGCG
>gnl|SRA|ERR949847.1100314.1 1100314
CAAGGTCATTCGCACCCTGGAAGAGATCCAGGAGTCATTCAACTCCGCGGCGCCGGGGAACATCAAAGTG
TCCTT
>gnl|SRA|ERR949847.316224.2 316224
TGCATGCCGCCCCCGGCGCCGCCGCGGCCGTCGTGGATGCGGTAGGTGCCGGCAGCGTGCCACGCCATCC
GGATA
>gnl|SRA|ERR949847.316224.1 316224
GACCATCGACGTTGACGCCCTGACGCGGGACATCGAGGAAGTGATGACCACCTCGCAGCCGTGGTGGCCC
GCCGA
>gnl|SRA|ERR949847.301240.1 301240
GGCCACTGACCTCTCGCTGCGGGTGGATCCGATCTATGAGCGGATCACGCGTCGCTGGCTGGAACACCCC
GAGGA
>gnl|SRA|ERR949847.301240.2 301240
CTTGTACCAGGCCTTGGCGAACTCGTCGGCCAATTCCTCGGGGAGCTCCAGCCAGCGACGCGTGATCCGC
TCATA
Now i have some question:
- Does this make sense?
If this make sense, there is a better way to get the same?
P.S.: I'm sorry if i made mistakes or if i didn't use the correct convention but i'm actually a newbie!
Ok! I didn't know that. Thank you for the explanation, i have to deal with. No, not an assembly, not directly at least. I want to do a drug susceptibility test with a software called Mykrobe, and my purpose is to build a dataset to use as input.