Hi,
I have a FASTA file that looks like this:
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826
TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094
ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2213:1470:83635
Which I created through the code below. The .bam file in question is an aDNA file from a Neanderthal (http://cdna.eva.mpg.de/neandertal/Chagyrskaya/BAM/).
samtools bam2fq chr22.rh.bam > chag.22.fq
bioawk -c fastx '{print ">"$name"\n"$seq}' chag.22.fq > chag.22.fa
I also have some BED files that correspond to specific proteins which look like this:
22 20061211 20061231
22 20061274 20061275
22 20061301 20061302
22 20061329 20061333
22 20061334 20061335
But when I try to get the sequence for this protein out of the FASTA file I get this error:
bedtools getfasta -fi try.fa -bed tango2_chag.bed -fo test.fa.out
WARNING. chromosome (22) was not found in the FASTA file. Skipping
I've tried to solve this by adding the chromosome identifier '22' to both the .bam and .fasta files separately, getting a .bam that looks like this:
samtools view chr22.rh.bam | awk -v OFS='\t' '{print ">"$3":"$name}' > try.bam
>22:D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826 0 22 16050036 0 44M * 0 0 TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] XI:Z:TGATAAC YI:Z:BCCBBEG XJ:Z:CATCCGG YJ:Z:CCCCCGG FF:i:3 Z0:i:0 XT:A:R NM:i:2 X0:i:2 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0A2C40 XA:Z:14,-19792922,44M,2; RG:Z:L5833
>22:D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094 0 22 16050038 0 42M * 0 0 ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] XI:Z:CCTAACG YI:Z:BB3ABGG XJ:Z:CAGTACT YJ:Z:BBCBBGG FF:i:3 Z0:i:0 XT:A:R NM:i:2 X0:i:2 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:2C4C34 XA:Z:14,-19792922,42M,2; RG:Z:L5831
Which either produces a 'Malformed FASTA file error' or the same 'WARNING. chromosome (22) was not found in the FASTA file. Skipping' warning.
Is there any way to get the sequence out for these proteins using the Neanderthal FASTA file? I CAN produce protein FASTAs using the human FASTA file downloaded from NCBI, but not with the Neanderthal files (but this obviously does not have the data I would want in them - wrong species, despite close relatedness).
There's something wrong with the FASTA files I'm making - can anyone help solve this problem? I can't just download Neanderthal FASTA files as from what I've found, they don't exist.
If you need reads that overlap the intervals you have in your BED files you can simply use
samtools view
command with the region (-L
) option.Hi, I need the sequences for the proteins from the FASTA file, I don't need to overlap the intervals in the protein .bed files, but thank you! Would this also work for the FASTA files?
Can you clarify how you are planning to get protein sequences from a DNA sequence alignment and that too of short reads from an ancient DNA sample. What is in
try.fa
?Hi, try.fa looks like the FASTA file at the top of the question
The protein sequences will be a bit spotty, but I want to essentially intersect the .bed file I have in the question to get as much of the actual sequence as I can.
I got these protein bed files via intersecting the DNA sequence alignment of the Neanderthal individual with the Gr37 human alignment. Of course, this did not work for all the proteins, as the aDNA .bed files do have gaps in them. This is just one protein that I could find. If it would be helpful (and if you want!) I could send you my pipeline?
Looks like you are mixing up chromosomal sequences from some human or Neanderthal assembly with the mapped to that assembly sequencing reads. The first (genomic fasta) knows nothing about any sequence variation likely present in your sequencing reads. The separate reads are of no use to look for the consensus sequence. You probably want to do something along the lines:
https://bioinformatics.stackexchange.com/questions/15168/how-to-generate-a-consensus-sequence-from-a-multi-reference-bam-file