Question

Find strandness of sequences from raw sequences

0

Entering edit mode

3.0 years ago

Parham ★ 1.6k

Hi, I have a tab delimited file that looks as follow:

chr10   82148982        82149017        CAGAGACTGTCTACCCGGAATCAACTGA
chr13   113829490       113829525       AAGAGGGACCTGACCCGGGGAGACCACC
chr18   12376182        12376248        TTCCCGTGGCCGACCCGGGGACCTCAAC
chr1    96371836        96371909        AGAATGGCGTGAACCCGGGAGGCGAAGC
chr22   41933155        41933213        AAGCATCTCCCTACCCGGCCGTCTCCTC
chr1    202157405       202157457       AGAATCGCTTGAACCCGGGAGGCGGAGG

and it goes on for thousand sequences. I need the strand information for each sequence. Could benefit from some help on how to fetch that from the genome and add it to the next column.

strand • 1.2k views

ADD COMMENT • link updated 3.0 years ago by ATpoint 86k • written 3.0 years ago by Parham ★ 1.6k

0

Entering edit mode

bedtools getfasta: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Simple.

Use getFasta from bedtools using first 3 columns and store the result in 5th column
Compare 4th with 5th column
If 4th column entry matches with 5th column, it is + strand and if it doesn't, it's - column (but make sure it's complementary)

If you want to be careful, take a different approach. Create a 5th column and store reverse complement of column 4 sequences, programmatically using revcomp or some other tool. Then repeat 1st step above and store the results in 6th column. If 6th column matches with 4 th column, then it is + strand (4th column sequence). If 6th column matches with 5th column, then it is - strand (4th column sequence).

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

Indeed, I was lazy and omitted the rev complement check in my answer below /shrug

ADD REPLY • link 3.0 years ago by ATpoint 86k

0

Entering edit mode

3.0 years ago

ATpoint 86k

You can use bedtools getfasta to retrieve the sequences from a reference file based on the coordinates and then compare the obtained sequence (which is always top strand) with your sequence. if they match it is +, else it is -.

Here an example using two GRCh38 sequences:

#/ get genome:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gz

#/ index with samtools:
gzip -d GRCh38.primary_assembly.genome.fa.gz
samtools faidx GRCh38.primary_assembly.genome.fa.gz

#/ examples, first one is minus, second is plus strand:
echo -e "chr10\t82148982\t82149017\tATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT\nchr13\t113829490\t113829525\tCCAAGAGGGACCTGACCCGGGGAGACCACCTGATC" > sequences.bed

$ cat sequences.bed 
chr10   82148982    82149017    ATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT
chr13   113829490   113829525   CCAAGAGGGACCTGACCCGGGGAGACCACCTGATC

#/ get fasta:
bedtools getfasta -bed sequences.bed -fi GRCh38.primary_assembly.genome.fa -fo retrieved.fa

#/ parse out to BED format
bioawk -c fastx '{print $name, $seq}' retrieved.fa | tr "\\:|\\-", "\t" > retrieved.bed

#/ compare and determine strand:
$ paste sequences.bed retrieved.bed \
| awk 'OFS="\t" {if($4 == $8) print $1, $2, $3, $4, ".", "+"} {if($4 != $8) print $1, $2, $3, $4, ".", "-"}'
chr10   82148982    82149017    ATTCCTCAGTTGATTCCGGGTAGACAGTCTCTGCT .   -
chr13   113829490   113829525   CCAAGAGGGACCTGACCCGGGGAGACCACCTGATC .   +

ADD COMMENT • link 3.0 years ago by ATpoint 86k

score 3 · Accepted Answer · 2021-12-13

3

Entering edit mode

3.0 years ago

GenoMax 148k

Are the sequences always from top/+ strand? If you look at the intervals they all seem to indicate that (in this example). Otherwise you will need to blat them to genome. Here is example 1 from your list (these must be human hg38 intervals):

00000001 cagagactgtctacccggaatcaactga 00000028
>>>>>>>> |||||||||||||||||||||||||||| >>>>>>>>
82148985 cagagactgtctacccggaatcaactga 82149012

ADD COMMENT • link 3.0 years ago by GenoMax 148k

0

Entering edit mode

Hi GenoMax, thanks for your expert comment. Yes they are hg38 intervals and what you said is very correct. I went back into my scripts that generated this data and figured out that only + strand was selected to be reported. So that solved my problem now.

ADD REPLY • link 3.0 years ago by Parham ★ 1.6k