Find strandness of sequences from raw sequences
2
0
Entering edit mode
3.0 years ago
Parham ★ 1.6k

Hi, I have a tab delimited file that looks as follow:

chr10   82148982        82149017        CAGAGACTGTCTACCCGGAATCAACTGA
chr13   113829490       113829525       AAGAGGGACCTGACCCGGGGAGACCACC
chr18   12376182        12376248        TTCCCGTGGCCGACCCGGGGACCTCAAC
chr1    96371836        96371909        AGAATGGCGTGAACCCGGGAGGCGAAGC
chr22   41933155        41933213        AAGCATCTCCCTACCCGGCCGTCTCCTC
chr1    202157405       202157457       AGAATCGCTTGAACCCGGGAGGCGGAGG

and it goes on for thousand sequences. I need the strand information for each sequence. Could benefit from some help on how to fetch that from the genome and add it to the next column.

strand • 1.2k views
ADD COMMENT
0
0
Entering edit mode

Simple.

  1. Use getFasta from bedtools using first 3 columns and store the result in 5th column
  2. Compare 4th with 5th column
  3. If 4th column entry matches with 5th column, it is + strand and if it doesn't, it's - column (but make sure it's complementary)

If you want to be careful, take a different approach. Create a 5th column and store reverse complement of column 4 sequences, programmatically using revcomp or some other tool. Then repeat 1st step above and store the results in 6th column. If 6th column matches with 4 th column, then it is + strand (4th column sequence). If 6th column matches with 5th column, then it is - strand (4th column sequence).

ADD REPLY
0
Entering edit mode

Indeed, I was lazy and omitted the rev complement check in my answer below /shrug

ADD REPLY
3
Entering edit mode
3.0 years ago
GenoMax 148k

Are the sequences always from top/+ strand? If you look at the intervals they all seem to indicate that (in this example). Otherwise you will need to blat them to genome. Here is example 1 from your list (these must be human hg38 intervals):

00000001 cagagactgtctacccggaatcaactga 00000028
>>>>>>>> |||||||||||||||||||||||||||| >>>>>>>>
82148985 cagagactgtctacccggaatcaactga 82149012
ADD COMMENT
0
Entering edit mode

Hi GenoMax, thanks for your expert comment. Yes they are hg38 intervals and what you said is very correct. I went back into my scripts that generated this data and figured out that only + strand was selected to be reported. So that solved my problem now.

ADD REPLY
0
Entering edit mode
3.0 years ago
ATpoint 86k

You can use bedtools getfasta to retrieve the sequences from a reference file based on the coordinates and then compare the obtained sequence (which is always top strand) with your sequence. if they match it is +, else it is -.

Here an example using two GRCh38 sequences:

#/ get genome:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gz

#/ index with samtools:
gzip -d GRCh38.primary_assembly.genome.fa.gz
samtools faidx GRCh38.primary_assembly.genome.fa.gz

#/ examples, first one is minus, second is plus strand:
echo -e "chr10\t82148982\t82149017\tATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT\nchr13\t113829490\t113829525\tCCAAGAGGGACCTGACCCGGGGAGACCACCTGATC" > sequences.bed

$ cat sequences.bed 
chr10   82148982    82149017    ATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT
chr13   113829490   113829525   CCAAGAGGGACCTGACCCGGGGAGACCACCTGATC

#/ get fasta:
bedtools getfasta -bed sequences.bed -fi GRCh38.primary_assembly.genome.fa -fo retrieved.fa

#/ parse out to BED format
bioawk -c fastx '{print $name, $seq}' retrieved.fa | tr "\\:|\\-", "\t" > retrieved.bed

#/ compare and determine strand:
$ paste sequences.bed retrieved.bed \
| awk 'OFS="\t" {if($4 == $8) print $1, $2, $3, $4, ".", "+"} {if($4 != $8) print $1, $2, $3, $4, ".", "-"}'
chr10   82148982    82149017    ATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT .   -
chr13   113829490   113829525   CCAAGAGGGACCTGACCCGGGGAGACCACCTGATC .   +
ADD COMMENT

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6