I am looking for the pipeline which provides bac-ends profiling to the reference genome.
I used to work with illumina myseq 16s rRNA paired-end sequences.And i used qiime2 to treat it. Now i have a sanger bac-end library of the whole genome and i need to map them to a genome sequence in order to be able to choose BAC with gene of interest.
I suppose that what i have to do is:
1.Filter out sequences that are 95% similarity or more to each other.
2.Remoove sequences which contain repeats.
3.Map remaining paired sequences with unique hits.
Thus i can choose a BAC which contains gene of interest for the further research. In theory.
Can anyone suggest tools for this work and maybe correct my "pipeline"?
I assume this is Sanger sequence data based on the length? You could just use blat from Jim Kent (download if you are a non-commercial user here). Using BLAST+ is also an option but since your sequences should be very homologous to the reference I would start with blat. I don't think there is a need to do steps 1 and 2.
I have done this with Xenopus tropicais BAC clones in order to identify BACs containing my gene of interest. First of all I would like to remind the possibility of some these clones not being just end sequenced but fully sequenced. In the case of Xenopus tropicalis, there were quite a few BAC clones that had full sequence information.
In my case, there was already a sequenced genome, therefore I was able to map BAC ends against my loci of interest. If this is not the case for your species, I would take a close enough species with a sequenced genome and take a few hundreds of thousands base pairs flanking your gene (ensuring to include quite a few, ideally large enough genes) and map the BAC ends against this "pseudo-reference". You can check some synteny databases to find a suitable species.
For mapping, I remember trying GMAP, but ended up using BLAST, because of speed and memory issues as far as I can remember. BLAST would do better anyway if you would not have a sequenced genome.
For filtering, I would remove BACs that do not have both ends sequenced but you might want to keep as many BACs as possible; in my case I had to order some of these BACs, prep and sequence them and therefore I ordered multiple overlapping/tiling BACs to make sure I would have the at least one clone that "worked". And you would not want to be too stringent about the distance between the "two ends", there can be quite some variation on that.
I have a sequenced genome. I used blast for mapping, but i don't know how to remove BACs with repeats and how to remove identical BAC clones. I've tried to find p16 gene as an example. To make it a bit easier I was mapping not to a whole genom but only to 300kb region which contain p16 gene in the middle (length of the inserted in the bac sequence is about 150kb). When i finally chose pair of bac end with best scores I mapped it to a whole genom, and this pair was found in almost every chromosome. As far as I understand this indicates that these BAC-ends contain repeat part of the genome. So i thought that i need to remove BACs with repeats first, Also i found a couple of articles in which they were removing sequences with 95% and above similarity, but I don't know what software to use for both of this tasks.
I assume you have sequenced from the end of the BAC vector into inserts so what you have is actually 800 bp of real genome sequence? Is that not the case?
If you do then you should be able to take those 800 bp and map them on the reference using blat or blast+. Be sure to trim out any vector sequence before you try to align.
If you have 800 bp of random sequence from a collection of BAC's then this will be more complex. You would need to identify few 800 bp fragments that are near your region of interest and then identify the specific BAC's from your collection of clones that contain those fragments.
So i thought that i need to remove BACs with repeats first
I guess this is rather tricky as at least one representative BAC (of the genomic region) should remain. Instead I would start with a filtering criteria based on the two ends from a given clone: You can <del>remove</del> keep BACs for which the two ends map to the same chromosome and separated by the "insert size" and has high complexity; high complexity being, when mapped against the whole genome the blast scores between the first and the second hits are higher than a threshold (that you will pick). This latter criteria addresses your point on "BACs with repeats".
As far as I understand what they were doing in that article is that from the pair of sequences with identity >=95% they were filtering out the shorter ones, but they didn't have reference genome, so maybe I don't need to do that.
There is one moment that I don't quite understend: shouldn't i have to leave those high complexity BACs instead of removing them? I thought that such BACs are the representative ones; also I wonder if I can set an approximate size of an insert using blast+?
It would help if you can specify what kind of data you have from the GSS library you mentioned above. Do you have full sequences of BAC or just ends?
Just ends. About 800 bp each. Paired-ends with ~130kb distance between each other.
I assume this is Sanger sequence data based on the length? You could just use
blat
from Jim Kent (download if you are a non-commercial user here). UsingBLAST+
is also an option but since your sequences should be very homologous to the reference I would start withblat
. I don't think there is a need to do steps 1 and 2.Thank you, I will try it!
Yes, those sequences are from the ends of the BAC. I will try it!