Supported file formats¶
MS-MINT is designed to support a variety of mass spectrometry (MS) data formats by converting them into a standardized tabular format for downstream analysis. Here's a breakdown of the file types you can use with MS-MINT, based on your code:
Supported File Formats in MS-MINT¶
Format | Extension | Description | Read By Function |
---|---|---|---|
mzXML | .mzxml |
An older open XML-based format for MS data. | mzxml_to_df() |
mzML | .mzml |
Widely used open standard for MS data (XML-based). | mzml_to_df() |
mzMLb | .mzmlb |
A binary-efficient variant of mzML (faster, smaller). | mzmlb_to_df__pyteomics() |
HDF5 | .hdf , .h5 |
Hierarchical format often used for storing large numerical datasets. | pd.read_hdf() |
Feather | .feather |
Fast, lightweight binary format for DataFrames (used with Arrow). | pd.read_feather() |
Parquet | .parquet |
Columnar data format for fast read and compression. | pd.read_parquet() |
How It Works¶
The function ms_file_to_df()
acts as a universal file loader. It:
- Detects the file extension (e.g.,
.mzML
,.hdf
, etc.) - Dispatches to the appropriate reader (e.g.,
mzml_to_df
,read_parquet
) - Normalizes the schema to include the following standard columns:
These columns are crucial for MS-MINT processing and analysis.
Special Cases and Notes¶
Time Unit Handling¶
- mzXML and mzML files may report scan times in minutes, but MS-MINT normalizes this to seconds.
Thermo RAW Parquet Files¶
- If you load a
.parquet
file not already in MS-MINT format, MS-MINT attempts to reformat it usingformat_thermo_raw_file_reader_parquet()
.
mzMLb Support¶
- Only works if the optional dependency
pyteomics.mzmlb
is available.