I've extracted the gene sequence of a particular gene from my single cell fastq data
I have seven bam files (one per each fastq file) that looks as follows:
A00126:151:H7JYWDSX2:4:2239:18222:26005 16 macaca-ref-custom_chr5 61645096 60 51S100M * 0 0 TACGACTACAGATTCATCCACGTGCTTGAGACTGTGGGCTTATAGTGGTGTGGCTTTGTAAACTGTTCTCTAGAAATATATTTATTCATGCAAACATGTATAAGTCACTCTGTATTTTTATACAACATAAGATTGTTTATGATCTTATGAC FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:1 MD:Z:27C72 AS:i:95 XS:i:21
A00126:187:HMKFWDSX2:4:2449:31213:26396 16 macaca-ref-custom_chr5 61645096 60 21S130M * 0 0 ACTGTGGGCTTATAGTGGTGTGGCTTTGTAAACTGTTCTCTAGAAATATATTTATTCATGCAAACATGTATAAGTCACTCTGTATTTTTATACAACATAAGATTGTTTATGATCTTATGACCATTTTTGTTTTTTTAAATTTTTTTTTGCC FFFFFFFFFFFFFFFF:FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF NM:i:1 MD:Z:27C102 AS:i:125 XS:i:21
A00126:151:H7JYWDSX2:4:2676:24623:35305 16 macaca-ref-custom_chr5 61645127 60 149M2S * 0 0 TATTCATGCAAACATGTATAAGTCACTCTGTATTTTTATACAACATAAGATTGTTTATGATCTTATGACCATTTTTGTTTTTTTAAATTTTTTTTTGCCAGAGAGGGCTTTCAGTAAATGTTTTATTTAGAACATTCTTTGAAAAAATTTG FFFFF:FFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFF:FFFFFFFF:FF:FFFFFFFFFFFFFFFFFFF:F,FF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:149 AS:i:149 XS:i:25
A00126:187:HMKFWDSX2:4:1378:32298:12242 16 macaca-ref-custom_chr5 61645127 60 149M2S * 0 0 TATTCATGCAAACATGTATAAGTCACTCTGTATTTTTATACAACATAAGATTGTTTATGATCTTATGACCATTTTTGTTTTTTTAAATTTTTTTTTGCCAGAGAGGGCTTTCAGTAAATGTTTTATTTAGAACATTCTTTGAAAAAATTTG FFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:149 AS:i:149 XS:i:25
Additionally I have a reference sequence fa file that is specific to the gene of interest:
macaca-ref-custom_chr5:61463887-61471702 AAAGGAAAGACTCATGTTTATTGAAAACAACCATCGACCCTTACAATCTTCCTTTCTGGAGAGCTACTAAAATGTT (etc)
macaca-ref-custom_chr5:61482518-61496429 ATGGTTACTCACAGTTTAAGCTTTGCCAGCACCAGTCCCATGGACTCTTGATCCCTACTTGGTAAC (etc)
I would like to take the reads from my data and highlight the location they map to in my reference sequence. Is there a straight forward way to do this?
Ideal output (doesn't have to be bolded, but some way to draw attention to the location of the aligned sequences in my reference sequence for the gene of interest):
macaca-ref-custom_chr5:61463887-61471702 AAAGGAAAGACTCATGTTTATTGAAAACAACCATCGACCCTTACAATCTTCCTTTCTGGAGAGCTACTAAAATGTT (etc)'
macaca-ref-custom_chr5:61482518-61496429 ATGGTTACTCACAGTTTAAGCTTTGCCAGCACCAGTCCCATGGACTCTTGATCCCTACTTGGTAAC (etc)
If this "reference" sequence is already in the genome that the original data has been aligned against you could simply extract the regions from you BAM (ref: Extract Reads From A Bam File That Fall Within A Given Region, if you wanted them isolated in a subset BAM file) otherwise you could simply look at the regions in a browser like IGV.
If I am misinterpreting the query then please clarify.
I already did what you suggested. I had single cell data that I aligned to the genome. I then extracted the reads for my gene of interest. This is example one.
I also pulled out the whole sequence of just my gene of interest from the genome. This is example two (its much longer, but I cut it off for example purposes.)
I would like to highlight where the exonic reads (the reads that aligned to the mrna in my single cell data) are in the genome sequence.
Right now I have multiple 150bp sequences that I pulled from my single cell data. However, I need to pull a single 1000bp sequence for my gene of interest. I am looking for a sequence that looks something like 150bp sequence from data-likely intronic region-150bp-intronic-150bp etc
I would like to create something that looks like example three, so I can accomplish this (the sequences are much longer then they are in the example)
Have you verified that you have coverage across the entire region? In general sc data is mostly from 3'-end of transcripts so you may not have entire gene.
If you have the reads pulled out in the region you could convert them to fasta format and then do a multiple sequence alignment using your gene as reference.
You could also try: https://github.com/orangeSi/bam2msa