Entering edit mode
3.7 years ago
luffy
▴
130
Dear All,
I am trying to get long non-coding RNA coordinates in gtf format. i have downloaded the file from here. but i want filter the file based on couple of conditions such as
- Remove records with length less then 200bps
- keep the records which are intersecting with coding region with 100bps upstream and downstream
first one was easily achievable using python
import pandas as pd
df_nc = pd.read_csv('gencode.v37.hg38.long_noncoding_RNAs.gtf', sep='\t', names['CHROM', 'HAVANA', 'TYPE', 'START', 'END', 'ID', 'STRAND', 'ID1','DETAILS'])
df_nc_len = df_nc[df_nc['END'] - df_nc['START'] >200]
How can go about with the next condition?
Also why do i find exons in the non-coding gtf
df_nc_len['TYPE'].value_counts()
the 3rd column gives me
exon 69042
transcript 48673
gene 17882
Any help would be much appreciated
I would tackle this by getting the file in BED format from the UCSC table browser, and then using BEDtools intersect. Manually trying to code genome arithmetic functions in python is like trying to reinvent the wheel at this point
@heskett, Thank you for your input, can you please let me know tracks to choose from UCSC table browser to arrive at only lncRNA coordinates of hg38 assembly and those which are intersecting with coding region with 100bps upstream and downstream
Thank you
I won't do the work for you but I can point you in the right direction. The GENCODE track will have coding and noncoding genes. it looks like there is a transcriptClass column that says coding or nonCoding. you can download these different files from gencode -> gene and gene predictions -> knowngene on the table browser site. Then use bedtools to find intersections and limit the overlap to 100bps. Learning how to use these tools will be very helpful if you continue doing genomics analysis
https://genome.ucsc.edu/cgi-bin/hgTables https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Dear heskett, there seem to be bit of misunderstanding, that was not my intension. Since i had already tired similar idea, hence was requesting you elaborate on that.
Things I have tired:
downloaded known coding regions from UCSC (refseq track) and used bedtools to intersect the noncoding coordinates (from gencode) with coding regions (from UCSC) then imported into pandas filter overlap which are less than 200 then did pandas merge (coding and noncoding) but there were duplicates so did drop duplicates (was not sure about removing duplicate)
i also tried to filter based on transcript type imported into df then the 3rd column (gtf from gencode) has different types (exon, transcript etc) and again in the last column separated by ';' has again different types (lncRNA, misc_RNA, processed_transcript, transcribed_unprocessed_pseudogene etc..). i am confused what to choose/drop
few more attempted i made all were not successful
Sorry and Thank you