Notes 2019.03.14

Here’s some notes on Comet’s internally-generated decoy peptides.

A decoy peptide is generated for each target peptide scored. This guarantees a 1:1 ratio of target to decoy peptides.
Comet generates pseudo-reverse decoy peptides per Elias & Gygi. In generating the pesudo-reverse decoy peptide, the terminal residue fixed (e.g. the last K or R for a tryptic peptide) and every other residue in the peptide is reversed. For an enzyme that cleaves n-terminal to a residue, such as AspN, the first residue in the peptide is fixed and every other residue is reversed. For example, target tryptic peptide CLSTWGK will generate a decoy peptide GWTSLCK. A target AspN peptide DSANLPQ will generate a decoy peptide DQPLNAS.
If a residue is modified, the modification will move with the residue in the decoy peptide e.g. target peptide M[15.9949]QEATLSK will generate a decoy peptide SLTAEQM[15.9949]K. If there were a distance constraint forcing this modification to only appear on the n-terminal residue of the peptide, this constraint is not enforced in the decoy generation.

2023/04/20 additional notes from an email I composed in 11/2022:

Here’s what I would tell people regarding using Comet’s internal decoys vs searching a forward + reverse FASTA:

There’s nothing special about decoy peptides; they’re just sequences that we expect are wrong so that we can use them for calculating FDR.
This means there’s nothing special about decoy peptides that come from reversing a protein vs. shuffling or however else decoys are generated.

Many computational researchers have spent a lot of time investigating/analyzing how to best generate decoys for optimal FDR calculations. At the end of the day, we’re just estimating FDR on a whole lot of peptide IDs and not performing some fine-tuned, precise calculation. And thinking about decoy generation, decoy sequences from reversing proteins are the least special decoys; there’s nothing magical about them. We use reverse protein decoys as they’re easy to generate and work fine. So that being said …

For each target peptide that is analyzed, Comet’s internal decoy generation keeps the target peptide’s C-terminal residue the same and reverses everything else. These are the pseudo-reverse peptides referred to by Elias and Gygi. If an N-terminal protease is applied, the N-term residue is left in place and everything else is reversed.

So fine … one could do a search against a forward + reverse FASTA file or against just a forward FASTA + Comet’s internal decoys. If everything were equal, which I’m arguing they pretty much are, there would be no real reason that one of these options would be preferred over the other. Certainly not from a FDR calculation standpoint. So here’s the reasons that Comet’s internal decoys has advantages:

Comet generates an internal decoy for each target peptide that is analyzed. This guarantees an exact 1:1 target to decoy peptide ratio irrespective of what database is searched and what search parameters are applied. A foward+reverse database has a 1:1 target:reverse protein ratio but the analyzed peptides will not be exactly 1:1. Generally it’s close enough to 1:1 for most forward+reverse searches that it doesn’t matter (because FDR is just an estimate anyways). But once you start using small databases and really tightening up your precursor tolerances, you’ll start deviating more from the ideal 1:1 target:decoy peptide ratio. One might argue that folks should never do FDR analysis on searches against small databases with narrow tolerances because the peptide search space is so small but at least the 1:1 target:decoy premise is never an issue with Comet’s internal decoys. Does this mean the internal decoy FDRs are more accurate? I have no clue and FDR is just an estimate so it’s not worth overthinking but it’s hard to imagine how it could be worse.
Comet’s internal decoy searches are faster than searching a forward-reverse FASTA with the difference increasing as you analyze more variable modifications. As a quick example, I just searched a random HeLa file with ~40K spectra against a human database. Without any variable mods, the search times are nearly identical. Adding in oxidized methionine as the only variable mod, the forward-reverse FASTA took 1:29 to complete while the internal decoy search took 1:12 to complete. When analyzing oxidized M and phospho STY, the forward-reverse FASTA took 5:09 and the internal decoy search took 3:40.