How to create a command that matches 13 characters in each column to be extracted into a new file, and leave behind the unmatched ones
0
0
Entering edit mode
7 weeks ago

Hello,

I have been trying to use featureCounts on Linux, and I'm a little stuck on how to proceed. I have been getting 0% matches with the program, and I found that one of your forums explained that it was due to the chromosome numbers being different in the .gtf file and the .bam file.

I have figured out how to get the names of the chromosomes of both files into one text file, but they do not perfectly match up since some appear in the gtf file, and not in the aligned sam/bam files. Can someone please help me figure out how to figure out how to get the lines with the first 13 matching characters to be put into a new file?

Thank you.

featureCounts • 1.3k views
ADD COMMENT
0
Entering edit mode

How certain are you that only chromosome names are mismatched? Could it be that at least some feature names are mismatched as well?

The reason I ask is this: no matter how you solve the existing problem, it is difficult to be certain that all the aspects will be solved. If possible, my suggestion to you is to find the matching genome and .gtf files and to repeat the mapping. That way there should be no ambiguity between the .bam files and the resulting counts.

ADD REPLY
0
Entering edit mode

I appreciate the concern of carefulness, but I'm very sure. I have scoured through thousands of lines of data for the past week, and the only thing off is the chromosome names in the .gtf and the .sam file. Literally, line-by-line. Also, featureCounts says that the only issue are the mismatch chromosome names with the --verbose tag activated in the script.

ADD REPLY
1
Entering edit mode

Did you use matched sequence and annotation from a single source or not? While we understand the desire to be secretive about your research, since we can't see your data we have to ask. It would help if you post a few lines of your SAM alignments and a couple of lines of GTF file.

I have figured out how to get the names of the chromosomes of both files into one text file, but they do not perfectly match up since some appear in the gtf file, and not in the aligned sam/bam files.

This particular bit should not be an issue. If your data does not contain expressed sequences from those chromosomes, the counts for those genes/features would be zero.

If I was to make a guess, you must have some other issue going on, if you are not getting counts for anything. It would also help to post your featureCounts command.

ADD REPLY
0
Entering edit mode

Thank you for answering. I thankfully was able to figure out what I needed to do to get the names of the chromosomes to line up to create a chromosome alias file. But, now the program cannot open that file.

Here is the output from the first 5 lines of the sam file

@HD VN:1.0 SO:unsorted

@SQ SN:k141_2140693_1 LN:744

@SQ SN:k141_2140693_2 LN:492

@SQ SN:k141_2140693_3 LN:1314

@SQ SN:k141_2140693_4 LN:3825

@SQ SN:k141_2140693_5 LN:597

@SQ SN:k141_2140693_6 LN:858

@SQ SN:k141_2140693_7 LN:312

Here are the first few headers of the .gtf chromosomes:

k141_2140693

k141_2140693

k141_2140693

k141_2140693

k141_2140693

k141_2140693

The featuresCount command: ./featureCounts -p --countReadPairs -O -T 48 -t CDS -g gene_id -a /vast/agnanad1/Leone/All_Megahit_Paired_map.gtf -A FeatureCounts_Aliases.txt -o All_Megahit_Paired_Genes_Count.txt /vast/agnanad1/Leone/All_Megahit_Paired_Genes_index/Megahit_Paired_mappedandsorted.bam

First few lines of chromosome alias file (which it cannot read):

k141_2140693 k141_2140693_1

k141_2140693 k141_2140693_2

k141_2140693 k141_2140693_3

k141_2140693 k141_2140693_4

k141_2140693 k141_2140693_5

k141_2140693 k141_2140693_6

k141_2140693 k141_2140693_7

I think maybe the SN in the samtools might be playing a factor. But I fear that the construction of the alias file might be as to why featureCounts is still failing

ADD REPLY
0
Entering edit mode

The alias file needs to have the following format according to the user manual:

Name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome.

Now for

SN in the samtools might be playing a factor.

That represents sequence name. It is in proper format.

I have never used the alias file but for a start fix the format of the alias file to match the requirement above.

Did you add _N to the fasta headers of the sequences? How are these going to be correlated with the GTF file, since they all seem to be of different length.

@HD VN:1.0 SO:unsorted

If you are showing us header of the file that you are using in your command line then this file appears to be unsorted (just FYI). featureCounts should sort the file as needed when it runs.

ADD REPLY
0
Entering edit mode

I thought I created the alias file correctly since I was following the guidelines set up by featureCounts. This is the same instructions from the program. As I put in my examples, the first column contains the names of the chromosomes pertaining to the .gtf file (e.g., k141_2140693), the second column has the corresponding names of the chromosomes in the reference genome (e.g., k141_2140693_1).

So, if my alias file is wrong, I will happily fix it. But I do not know how it is wrong.

I did not add the _N to the fasta headers for the sequences. That was in the genome I created from Prodigal to produce a de-replicated reference genome to be input into bowtie2 and samtools to map my forward and reverse reads to it.

I know the file is unsorted. I have multiple variations that are unsorted and sorted to see if it would change the results. Sadly, it does not.

ADD REPLY
0
Entering edit mode

Ah I now see that there is a space between the two columns (formatting was not clear). Try using a tab instead if you think the file is not being read.

I still can't understand how the same name in column 1 can be mapped onto separate _N aliases. Ideally you need to change the GTF so the _N corresponding to the sequences appear in that file.

Here are the first few headers of the .gtf chromosomes:

k141_2140693

k141_2140693

Now that I re-read the post above you have not shown us example lines from GTF file. GTF files have 9 fields not just one column. For ref: https://www.ensembl.org/info/website/upload/gff.html

ADD REPLY
0
Entering edit mode

I have scoured through thousands of lines of data for the past week

I think you are going about this the wrong way. You could have re-mapped the dataset with proper genome and .gtf files - why do you spend a week doing manual checking when that procedure is inherently error-prone?

The fact that you managed to do chromosome name conversion but still didn't solve the problem also argues that maybe you should consider re-mapping the reads.

ADD REPLY
0
Entering edit mode

I understand the complaint, but this is my first time doing this. So, I'm learning as I go. I am trying to follow a similar pipeline as a foundation for my own, and this is how they ran their programs. It does not produce a "proper" genome from the program I am using. It produces a .gff file that has to be converted into a .gtf file via grep.

I'm not opposed to re-mapping the reads. But there is no need for condescending language when I'm doing this for the first time. It truly is not helpful.

ADD REPLY
0
Entering edit mode

But there is no need for condescending language when I'm doing this for the first time. It truly is not helpful.

Maybe I was being short in my explanation - I answer hundreds of questions daily on and off this site - but I don't see anything condescending. It is not like I was attacking your ability or belittling your intelligence. If anything, I was trying to be somewhat forceful here to save you some time if you decide to keep on the same track as before.

This is like getting a deflated tire. If a hole is small, it can be patched. Yet that patch will always be a weak point on that tire, and may become a problem again. I think your tire has too big of a hole and it is more productive to replace it rather than patch it. Either way, my opinion that you are going about it the wrong way is reflective of my experience with these types of problems, and most certainly is not meant as an insult to you or anyone else who supports a different opinion.

ADD REPLY

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6