I have managed to get the sequences for about 1300 (out of ~20000) with Biomart using DBASS5 Gene Name as filter.
I have also used the Table Browser from UCSC..Some of the id's (~3000, not sure which) are not compatible with the refseq gene ids from their repository. The ids returned are in the following format:
I have uploaded a file containing the gene names..That is the filter type ..for the type of genes, in the drop down list of the Filters/ID_list_limit. Unfortunately, the data contains some gene ids which are not recognized by that filter.
I have uploaded them in the Genome Browser and apparently around 3000 of them are not recognized as ref seq gene ids
Are you interested in pulling out the sequences of the actual 3'UTR or would it suffice to retrieve a specific length of sequence after every STOP in the coding region? If it's the latter, I can suggest a way to do it in galaxy.
If I get it right, then the supplemental data comes from the Nature article "Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs" where the authors carried out experiments on human cell lines, so I assume that the gene names are HGNC symbols (cannot access the full-text right now). Below I describe what I did to get 18,457 3'UTRs from Ensembl's BioMart installation:
Unfortunately, the sequence download in BioMart 0.7 is not very reliable and I suggest you try downloading the same information multiple times until you have a couple of files that are of the same size. The file I got was 55,288,206 bytes in size, contained 18,457 entries, where the gene name list from the Excel files contains 20,401 gene names.
Thank you for your detailed post. I could see that BioMart 0.7 is not very reliable because I have tried different types of filters and I did get sequences every time ..(from 20 000 to 60 000). This time looks about right :-).. so, thank you.
A little confused as to how you retrieved UTRs for 1300 genes using only one gene name? Perhaps describe more exactly what you did in BioMart?
I have uploaded a file containing the gene names..That is the filter type ..for the type of genes, in the drop down list of the Filters/ID_list_limit. Unfortunately, the data contains some gene ids which are not recognized by that filter. I have uploaded them in the Genome Browser and apparently around 3000 of them are not recognized as ref seq gene ids
Are you interested in pulling out the sequences of the actual 3'UTR or would it suffice to retrieve a specific length of sequence after every STOP in the coding region? If it's the latter, I can suggest a way to do it in galaxy.
I need the sequences for the actual UTR for motif finding .
i am interested to retrieve 3'UTR region from all the reported genes og buffalo. how can i do this?