Supported file formats¶
MS-MINT is designed to support a variety of mass spectrometry (MS) data formats by converting them into a standardized tabular format for downstream analysis. Here's a breakdown of the file types you can use with MS-MINT, based on your code:
Supported File Formats in MS-MINT¶
| Format | Extension | Description | Read By Function |
|---|---|---|---|
| mzXML | .mzxml |
An older open XML-based format for MS data. | mzxml_to_df() |
| mzML | .mzml |
Widely used open standard for MS data (XML-based). | mzml_to_df() |
| mzMLb | .mzmlb |
A binary-efficient variant of mzML (faster, smaller). | mzmlb_to_df__pyteomics() |
| HDF5 | .hdf, .h5 |
Hierarchical format often used for storing large numerical datasets. | pd.read_hdf() |
| Feather | .feather |
Fast, lightweight binary format for DataFrames (used with Arrow). | pd.read_feather() |
| Parquet | .parquet |
Columnar data format for fast read and compression. | pd.read_parquet() |
How It Works¶
The function ms_file_to_df() acts as a universal file loader. It:
- Detects the file extension (e.g.,
.mzML,.hdf, etc.) - Dispatches to the appropriate reader (e.g.,
mzml_to_df,read_parquet) - Normalizes the schema to include the following standard columns:
These columns are crucial for MS-MINT processing and analysis.
Special Cases and Notes¶
Time Unit Handling¶
- mzXML and mzML files may report scan times in minutes, but MS-MINT normalizes this to seconds.
Thermo RAW Parquet Files¶
- If you load a
.parquetfile not already in MS-MINT format, MS-MINT attempts to reformat it usingformat_thermo_raw_file_reader_parquet().
mzMLb Support¶
- Only works if the optional dependency
pyteomics.mzmlbis available.