Filtering Tss List To Remove Tss Within X #Bp From One Another
3
0
Entering edit mode
11.2 years ago
bede.portz ▴ 540

I have a list of RefSeq transcription start sites (TSS) obtained from the UCSC table browser. I want to remove those TSS within X #bp from another TSS on the same list. This will include both genes with multiple TSS, as well as genes on the opposite strand that may be in a head to head arrangement. I say X #of bp as I will be filtering for different distances depending on the analysis. How do I go about this? Is there a tool?

Ideally, I would be able to separately filter out instances of multiple TSS for a given gene and nearby TSS from a gene on the opposite strand.

I apologize as this has probably been answered before, but I have been unable to uncover an answer using the search function.

Thanks.

EDIT:

An example from USCS

As you can see, in this example there are two genes, they both happen to be paralogs of Hsp70, but they could be any two genes, that are oriented in close proximity in a head to head manner. One gene is on the + strand, the other on the - strand. The TSS for each of the genes in this example are close together, less than 2kb. Both TSS would show up in a list of RefSeq TSS for Drosophila.

The data I am currently working with are reads from a ChIP-seq experiment in mouse. While this genome is less compact, there are similar examples to be sure. This types of genes can yield specific types of artifacts in my analysis, and I need to filter them out.

The list TSS List I am working with is structured as follows:

#name       chrom    strand    txStart
NM_001008533    chr1    -    134199214
NM_001039510    chr1    -    134199214
NM_001282945    chr1    -    134199214
NM_175642    chr1    -    25067475

This list contains genes from both strands, however the first five lines all happen to be - strand TSS. Likewise, you can see the second instance I want to be able to filter out, those genes with multiple TSS, as is the case with the gene represented in the first three lines of this sample file

Both situations; those genes nearby to other genes in the genome, and those genes with multiple TSS, need to be filtered out for my analysis.

Thanks everyone for your time.

Bede

chip-seq rna-seq • 5.3k views
ADD COMMENT
0
Entering edit mode

Please paste a few lines from your file. Also, if you can tell what exactly you need by giving an example, it will help others so that they will be able to help you.

ADD REPLY
3
Entering edit mode
11.1 years ago

If you convert your TSS coordinates into a BED file, you can use BEDOPS bedmap with the --range extension to solve this problem efficiently.

In summary:

  1. Convert your TSS input (in whatever format it is in) to a four-column BED file with awk, Perl or whatever tool you prefer: $ awk '{...}' TSS.txt > TSS.unsorted.bed
  2. Sort this BED file with sort-bed to ensure it can allow bedmap to work efficiently: $ sort-bed TSS.unsorted.bed > TSS.bed
  3. Extend your sorted, BED-formatted TSS elements by +/-2 kb or whatever range desired with bedmap --range, and map the file onto itself, using --echo and --echo-map to report reference and mapped elements:

    $ bedmap --range 2000 --echo --echo-map TSS.bed > TSS.ext.bed
    

    Results will be of the form: [ element-1 ] | [ map-1 ] ; [ map-2 ] ; ... ; [ map-a ] [ element-2 ] | [ map-1 ] ; [ map-2 ] ; ... ; [ map-b ] ... [ element-n ] | [ map-1 ] ; [ map-2 ] ; ... ; [ map-k ]

    For every row — for every element — there are zero or more map elements which overlap the associated 2000 bp-padded element by one or more bases.

    Because we only want non-overlapping elements, our next step is to filter out any element results where there are two or more map elements.

    To conceptually understand this step better, run this bedmap step on a few sample lines of your input and see what is in the output. It should be clear what the output represents based on the coordinates that are shown.

  4. Filter this mapped output with awk, to get elements that have only one mapped result. We take advantage of the default field delimiters (| and ;) that bedmap adds:

    $ awk -F'|' '{ cnt = split($2,arr,";"); if (cnt==1) print $1; }' TSS.ext.bed > TSS.nooverlap.bed
    

The file TSS.nooverlap.bed will have your non-overlapping TSSs.

The nice thing about using bedmap is that the --range operand does not change the coordinates of the reference elements, but only changes the internal coordinates used to apply operations on the original, reference elements. So this makes the final result easier to report as-is, with no extra processing involved to adjust ranges.

Once you understand and can apply these steps, a lot of these can be strung together into a UNIX pipeline, which will improve performance immensely if you are working with large datasets:

$ sort-bed input.bed | bedmap --options - | awk {...} - > output.bed
ADD COMMENT
0
Entering edit mode

Alex, thanks for taking the time to provide a detailed answer.

ADD REPLY
0
Entering edit mode

Alex, I just installed and used BedOps for the first time and followed your instructions. I think everything worked. Beyond that, I was impressed with the speed with which BedOps operated!

ADD REPLY
0
Entering edit mode

Glad to hear that it worked out. And Shane will be happy to hear your feedback on performance!

ADD REPLY
0
Entering edit mode

One more question to ensure I understand what just happened. Am I correct in saying the aforementioned operations will remove not only those TSS within X distance on opposite strands, i.e. head to head genes such as those illustrated in my browser shot, but also the instances of multiple TSS with X base pairs or one another on the same gene, i.e. genes with multiple TSS? I believe this to be the case, but I want to make sure I followed and understood what the tools were doing.

Thanks again.

ADD REPLY
0
Entering edit mode

FOLLOW UP: Looking at the UCSC browser at a handful of genes my analysis revealed as "interesting" shows that they have a TSS very near to the TSS of the gene in question. Thus, I failed to generate a filtered TSS list as I sought. Back to the drawing board.

ADD REPLY
0
Entering edit mode

Consider thinking about preprocessing your TSS list into separate categories, then deal with special cases by applying separate sets of operations.

ADD REPLY
0
Entering edit mode

I downloaded a list of TSS using the UCSC table browser. Examining some "TSS" on this least revealed they are in fact TTS! This caused all manner of problems, as one may imagine.

ADD REPLY
0
Entering edit mode

You may need to validate your input set, filtering out data you're not interested in.

ADD REPLY
0
Entering edit mode
11.1 years ago

To remove all but one transcripts that have the same starting positions, you can use remove duplicates function from excel. Under the data tab, try remove duplicates using the fourth column. It will keep the first entry and remove others. This will take care of transcripts with same starting position. After this, you can calculate difference between the fourth column values between subsequent rows using excel and then use those values to filter out rows that have their transcription start position within some distance of the previous trnascript. As the file is already sorted using chromosome and positions it will take care of transcripts on both positive and negative strands. Hope this is what you are looking for.

ADD COMMENT

Login before adding your answer.

Traffic: 2176 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6