MS-MLBenchmark: Comprehensive ML-ready Benchmark for Mass Spectrometry-based Proteomics

Both database search algorithms and denovo solutions attempt to deduce peptides from mass spectrometry data. While used regularly in a systems biology setting, both methods have pitfalls that are being addressed by the computational mass spectrometry community. Development of machine-learning models is an active area of research. Although rapid progress is being made, there are very limited benchmarking datasets that can be used to assess the performance of different models. Most of the assessment, including our own models, have been accomplished using ad hoc data sets. The objective of this study is to develop benchmarking data that can be used for assessment of all deduction engines. Such benchmarking data will then behave as a “scale” against which all models will be measured. Our central hypothesis is that sufficient variation in the parameters of the data (i.e. species, fragmentation methodology, PTM’s, mass spectrometer machines) when organized in a systematic manner would result in enough distribution to assess ML models in a robust fashion.

Participate

Data collection and curation for this study is ongoing.