Supported file formats¶
MS-MINT is designed to support a variety of mass spectrometry (MS) data formats by converting them into a standardized tabular format for downstream analysis. Here's a breakdown of the file types you can use with MS-MINT, based on your code:
Supported File Formats in MS-MINT¶
Format | Extension | Description | Read By Function |
---|---|---|---|
mzXML | .mzxml |
An older open XML-based format for MS data. | mzxml_to_df() |
mzML | .mzml |
Widely used open standard for MS data (XML-based). | mzml_to_df() |
mzMLb | .mzmlb |
A binary-efficient variant of mzML (faster, smaller). | mzmlb_to_df__pyteomics() |
HDF5 | .hdf , .h5 |
Hierarchical format often used for storing large numerical datasets. | pd.read_hdf() |
Feather | .feather |
Fast, lightweight binary format for DataFrames (used with Arrow). | pd.read_feather() |
Parquet | .parquet |
Columnar data format for fast read and compression. | pd.read_parquet() |
How It Works¶
The function ms_file_to_df()
acts as a universal file loader. It:
- Detects the file extension (e.g.,
.mzML
,.hdf
, etc.) - Dispatches to the appropriate reader (e.g.,
mzml_to_df
,read_parquet
) - Normalizes the schema to include the following standard columns:
These columns are crucial for MS-MINT processing and analysis.
Special Cases and Notes¶
- Time Unit Handling:
-
mzXML and mzML files may report scan times in minutes, but MS-MINT normalizes this to seconds.
-
Thermo RAW Parquet Files:
-
If you load a
.parquet
file not already in MS-MINT format, MS-MINT attempts to reformat it usingformat_thermo_raw_file_reader_parquet()
. -
mzMLb Support:
- Only works if the optional dependency
pyteomics.mzmlb
is available. - If not installed, MS-MINT will log a warning but still function for other formats.