Entering edit mode
3.6 years ago
storm1907
▴
30
Hello,
I try to intersect GTF files with tped files with purpose to switch chromosomal location to gene location in tped files.
I created some topics before about steps needed:
- conversion from genomic to gene coordinates in a large file
- bedtools intersect combined with filtering after particular word
- converting spaces to tabs in gtf files
- conversion from genomic to gene coordinates in a large file
- tped files to bed with awk
So, I did steps suggested:
- Got gtf from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
- extracted only genes annotation with
awk -F"\t" '$3=="gene"' my_file.gtf > my_genes_only_file.gtf
- converted nc chromosome symbols from
NC_xxx
toX
withsed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' GCF_000001405.39_GRCh38.p13_genomic.gtf > new.gtf
- converted to bed with
gtf2bed < foo.gtf
:gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out
- removed unwanted symbols from final intersected file with
cat test sed 's/"//g' | sed 's/;//g' test > fintest
- printed out only the needed columns from intersected file:
cat fintest | awk '{print $4,$33,$35,$34,$36,$37}' > fintestnew
In the end I have file with different content of columns:
KLHL17 0 C C
KLHL17 0 A A
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669
However the desired structure should be like this:
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G
Is it because of inconsistency in GTF files?
Thank you!
can you please post the output from
awk -F "\t" '{print $4,$33,$35,$34,$36,$37}' fintest > fintestnew
? Please post few example lines fromout
file.and here is the converted out.bed file
I think you are either using wrong input or expecting a wrong output. In expected output, there are alleles (singletons), bed (out file) doesn't have such entries (in copy/pasted out.bed). In addition,
cat test sed 's/"//g' | sed 's/;//g' test > fintest
code is totally incorrect unless there is a copy/paste mistake. Assuming that function is correct, thesed
code removes field separators which is incorrect, for downstream processing. You need to explain where third column (0 s), 5 and 6 coming from. Output bed file you shared here doesn't catch all those fields. Without correct data, it is difficult to address the issue.Sorry, forgot one step, bedtools intersect
Before step 5 bedtools intersect -wb was used, to get the file with both gene names, gene positions and SNPs gtf file was intersected with sample bed file (obtained from Plink tped file) Sample bed file:
bed file from gtf:
gtf2bed < GCF_000001405.39_GRCh38.p13_genomic_geneonly_1copy.gtf > out
storm1907 : Please do not delete threads once they have received comments/answers.
Sorry. Nobody answered, so I created another post. So, how can I get help to my question then, once this thread seems to be ignored?