Skip to the content.

Notes 2024/10/01: fragment ion indexing support in Comet

Fragment ion indexing was first introduced by MSFragger in 2017 and this strategy has since been adopted in search tools like MetaMorpheus and Sage. And yes, you are encouraged to go use MSFragger, MetaMorpheus, Sage and all of the other great peptide identification tools out there.

Fragment ion indexing (abbreviated as “FI” or “Comet-FI” going forward) is supported in Comet as of version 2024.02 rev. 0. Given this is the first Comet release with FI functionality, we expect to improve on features, performance, and functionality going forward.

Comet-FI preprint:https://www.biorxiv.org/content/10.1101/2024.10.11.617953v1

A Comet-FI search is invoked when the search database is an .idx file. It can be specified in the “database_name” parameter entry in comet.params or via the “-D” command line option. Effectively it’s as simple as specifying “human.fasta.idx” as your search database if the corresponding FASTA file is “human.fasta”.

An .idx file is a plain peptide file containing a list of peptides, their unmodified masses, pointers to proteins that the peptides are present in, and combinatorial bitmasks representing potential variable modification positions. A plain peptide .idx file can be created using the “-i” command line option if you would like to create it directly without it being part of a search. In the first example below, an .idx file will be created for whatever the search database is specified in the comet.params file. In the second example below, the search database “human.fasta” is specified by the command line option “-D” which will override the database specified in comet.params. The new .idx file will be created with the same name as the input FASTA file but with an .idx extension. In the second example below, “human.fasta.idx” would be created.

comet.exe -i
comet.exe -Dhuman.fasta -i

Note that you are also able to simply specify an .idx file as the search database without creating it first. If an .idx database is specified but doesn’t exist, Comet will create the .idx file from the corresponding FASTA file and then proceed with the search.

The examples below all specify the .idx file on the command line using the -D command line option.

comet.exe -Dhuman.fasta.idx somefile.raw
comet.exe -Dhuman.fasta.idx *.mzML
comet.exe -Dhuman.fasta.idx 202410*.mzXML

The commands below would be the equivalent FI search if “database_name = human.fasta.idx” was set in comet.params:

comet.exe somefile.raw
comet.exe *.mzML
comet.exe 202410*.mzXML

Any time the set of variable modifications or the digestion mass range are changed (and I’m sure other parameters I’m forgetting right now), you should re-create the .idx file. If these parameters do not change, you can use the same .idx file. This process of checking if the .idx file needs to be updated will be performed automatically in some future update.

Once a Comet-FI search is invoked, the plain peptide file is parsed, all peptide modification permutations are generated, the bazillion fragment ions calculated and the FI populated. Then the input files are queried against the FI. If multiple input files are searched (aka “comet *.raw”) then the one time cost of generating the FI, which happens once at the beginning of the search, can be avoided for all subsequent files being searched. The process of calculating all fragment ions and populating the index can take a long time and consume lots of memory for search spaces (like large databases, unspecific cleavage rules, and multiple variable modifications).

Current limitations and known issues with Comet-FI:

Fragment ion index specific search parameters

Memory use and performance

There are many factors that go into how much memory will be consumed including:

One can easily generate over a billion fragment ions in a standard human, target + decoy, tryptic analysis by adding a few variable modifications. And representing a billion fragment ions in a fragment index in memory will require many GBs of RAM. You might get away with some smaller searches on a 16 GB or 32 GB machine. Many searches can be done with 64 GB RAM. And if you’re a power user who wants to analyze MHC peptides requiring non-specific enzyme constraint searches, don’t attempt to run such a search with many modifications as it simply won’t work.

The following searches were run using 8-cores of an AMD Epyc 7443P processor running on Ubuntu linux version 22.04. The query file is a two hour Orbitrap Lumos run with MS/MS spectra acquired in high res (approximately 92,000 MS/MS spectra). Up to two of each specified variable modifications are allowed in a peptide, peptide length 7 to 50, digest mass range 600.0 to 5000.0, tryptic digest with 2 allowed missed cleavages, +/- 20 ppm precursor tolerance considering up to two isotope offsets. The .idx files were created before the search so the reported search times and .idx creation times are both noted.