Hi, I have a tab delimited file that looks as follow:
chr10 82148982 82149017 CAGAGACTGTCTACCCGGAATCAACTGA
chr13 113829490 113829525 AAGAGGGACCTGACCCGGGGAGACCACC
chr18 12376182 12376248 TTCCCGTGGCCGACCCGGGGACCTCAAC
chr1 96371836 96371909 AGAATGGCGTGAACCCGGGAGGCGAAGC
chr22 41933155 41933213 AAGCATCTCCCTACCCGGCCGTCTCCTC
chr1 202157405 202157457 AGAATCGCTTGAACCCGGGAGGCGGAGG
and it goes on for thousand sequences. I need the strand information for each sequence. Could benefit from some help on how to fetch that from the genome and add it to the next column.
bedtools getfasta: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
Simple.
getFasta
from bedtools using first 3 columns and store the result in 5th columnIf you want to be careful, take a different approach. Create a 5th column and store reverse complement of column 4 sequences, programmatically using revcomp or some other tool. Then repeat 1st step above and store the results in 6th column. If 6th column matches with 4 th column, then it is + strand (4th column sequence). If 6th column matches with 5th column, then it is - strand (4th column sequence).
Indeed, I was lazy and omitted the rev complement check in my answer below /shrug