Oral Presentation 24th Annual Lorne Proteomics Symposium 2019

Toffee: A highly compressed file format for time of flight and orbitrap DIA-MS (#38)

David Clarke 1 , Akila Seneviratne 1 , Brett Tully 1
  1. Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia

Data generated by Data Independent Acquisition mass spectrometry (DIA-MS) exists in a proprietary data format that is opaque to open-source software. Typically, a user will convert this data to the open mzML format before utilising a computational proteomics pipeline such as OpenSWATH [1]. However, mzML files tend to consume a large disk space and are organised around slices (or scans) in retention-time space. The former limitation significantly increases the computational hardware and funding required for high throughput proteomics, while the latter limits the manner in which software can efficiently access the data. Toffee is an HDF5 backed file format that is portable, open, and highly efficient with resulting file sizes that are 5% smaller than the original vendor file. It takes advantage of the inherent sparsity of DIA-MS raw data and the physics of the relevant mass analyser, to compress the data while preserving the information content. Toffee also enables fast access of subsections of data which enables trivial exploration and manipulation of the raw data. These benefits come at a cost of a mass accuracy loss of 5-10 ppm; however, initial testing shows that this mass accuracy error has negligible impact on OpenMSToffee (a wrapper around OpenSWATH that enables its use with toffee files) results.

  1. Röst, H. L., Rosenberger, G., Navarro, P., Gillet, L., Miladinović, S. M., Schubert, O. T., … Aebersold, R. (2014). OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nature Biotechnology, 32(3), 219–223. https://doi.org/10.1038/nbt.2841