Metadata-Version: 2.1
Name: MEDS_tabularization
Version: 0.0.1
Summary: Scalable Tabularization of MEDS format Time-Series data
Author-email: Matthew McDermott <mattmcdermott8@gmail.com>, Nassim Oufattole <noufattole@gmail.com>, Teya Bergamaschi <teyabergamaschi@gmail.com>
Project-URL: Homepage, https://github.com/mmcdermott/MEDS_Tabular_AutoML
Project-URL: Issues, https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: polars
Requires-Dist: pyarrow
Requires-Dist: loguru
Requires-Dist: hydra-core
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: pandas
Requires-Dist: tqdm
Requires-Dist: xgboost
Requires-Dist: scikit-learn
Requires-Dist: hydra-optuna-sweeper
Requires-Dist: hydra-joblib-launcher
Requires-Dist: ml-mixins
Provides-Extra: dev
Requires-Dist: pre-commit ; extra == 'dev'
Provides-Extra: profiling
Requires-Dist: mprofile ; extra == 'profiling'
Requires-Dist: matplotlib ; extra == 'profiling'
Provides-Extra: tests
Requires-Dist: pytest ; extra == 'tests'
Requires-Dist: pytest-cov ; extra == 'tests'
Requires-Dist: rootutils ; extra == 'tests'

# Scalable tabularization and tabular feature usage utilities over generic MEDS datasets

This repository provides utilities and scripts to run limited automatic tabular ML pipelines for generic MEDS
datasets.

#### Q1: What do you mean "tabular pipelines"? Isn't _all_ structured EHR data already tabular?

This is a common misconception. _Tabular_ data refers to data that can be organized in a consistent, logical
set of rows/columns such that the entirety of a "sample" or "instance" for modeling or analysis is contained
in a single row, and the set of columns possibly observed (there can be missingness) is consistent across all
rows. Structured EHR data does not satisfy this definition, as we will have different numbers of observations
of medical codes and values at different timestamps for different patients, so it cannot simultanesouly
satisfy the (1) "single row single instance", (2) "consistent set of columns", and (3) "logical" requirements.
Thus, in this pipeline, when we say we will produce a "tabular" view of MEDS data, we mean a dataset that can
realize these constraints, which will explicitly involve summarizing the patient data over various historical
or future windows in time to produce a single row per patient with a consistent, logical set of columns
(though there may still be missingness).

#### Q2: Why not other systems?

- [TemporAI](https://github.com/vanderschaarlab/temporai) is the most natural competitor, and already
  supports AutoML capabilities. However, TemporAI (as of now) does not support generic MEDS datasets, and it
  is not clear if their AutoML systems will scale to the size of datasets we need to support. But, further
  investigation is needed, and it may be the case that the best solution here is simply to write a custom
  data source for MEDS data within TemporAI and leverage their tools.

# Installation

Clone this repository and install the requirements by running `pip install .` in the root directory.

# Usage

This repository consists of two key pieces:

1. Construction of and efficient loading of tabular (flat, non-longitudinal) summary features describing
   patient records in MEDS over arbitrary time-windows (e.g. 1 year, 6 months, etc.) either backwards or
   forwards in time from a given index date. Naturally, only "look-back" windows should be used for
   future-event prediction tasks; however, the capability to summarize "look-ahead" windows is also useful
   for characterizing and describing the differences between patient populations statistically.
2. Running basic AutoML pipelines over these tabular features to predict arbitrary binary classification
   downstream tasks defined over these datasets. The "AutoML" part of this is not particularly advanced --
   what is more advanced is the efficient construction, storage, and loading of tabular features for the
   candidate AutoML models, enabling a far more extensive search over different featurization strategies.

### Scripts and Examples

See `tests/test_tabularize_integration.py` for an example of the end-to-end pipeline being run on synthetic data. This
script is a functional test that is also run with `pytest` to verify the correctness of the algorithm.

#### Core Scripts:

1. `scripts/identify_columns.py` loads all training shard to identify which feature columns
   to generate tabular data for.
2. `scripts/tabularize_static.py` Iterates through shards and generates tabular vectors for
   each patient. There is a single row per patient for each shard.
3. `scripts/summarize_over_windows.py` For each shard, iterates through window sizes and aggregations to and
   horizontally concatenates the outputs to generate the final tabular representations at every event time for
   every patient.
4. `scripts/tabularize_merge` Aligns the time-series window aggregations (generated in the previous step) with
   the static tabular vectors and caches them for training.
5. `scripts/hf_cohort/aces_task_extraction.py` Generates the task labels and caches them with the event_id
   indexes which align them with the nearest prior event in the tabular data.
6. `scripts/xgboost_sweep.py` Tunes XGboost on methods. Iterates through the labels and corresponding tabular data.

We run this on an example dataset using the following bash scripts in sequence:

```bash
bash hf_cohort_shard.sh  # processes the dataset into meds format
bash hf_cohort_e2e.sh  # performs (steps 1-4 above)
bash hf_cohort/aces_task.sh  # generates labels (step 5)
bash xgboost.sh  # trains xgboos (step 6)
```

## Feature Construction, Storage, and Loading

Tabularization of a (raw) MEDS dataset is done by running the `scripts/data/tabularize.py` script. This script
must inherently do a base level of preprocessing over the MEDS data, then will construct a sharded tabular
representation that respects the overall sharding of the raw data. This script uses [Hydra](https://hydra.cc/)
to manage configuration, and the configuration file is located at `configs/tabularize.yaml`.

## AutoML Pipelines

# TODOs

1. Leverage the "event bound aggregation" capabilities of [ESGPT Task
   Select](https://github.com/justin13601/ESGPTTaskQuerying/) to construct tabular summary features for
   event-bound historical windows (e.g., until the prior admission, until the last diagnosis of some type,
   etc.).
2. Support more feature aggregation functions.
3. Probably rename this repository, as the focus is really more on the tabularization and feature usage
   utilities than on the AutoML pipelines themselves.
4. Import, rather than reimplement, the mapper utilities from the MEDS preprocessing repository.
5. Investigate the feasibility of using TemporAI for this task.
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
   repositories.

# YAML Configuration File

- `MEDS_cohort_dir`: directory of MEDS format dataset that is ingested.
- `tabularized_data_dir`: output directory of tabularized data.
- `min_code_inclusion_frequency`: The base feature inclusion frequency that should be used to dictate
  what features can be included in the flat representation. It can either be a float, in which
  case it applies across all measurements, or `None`, in which case no filtering is applied, or
  a dictionary from measurement type to a float dictating a per-measurement-type inclusion
  cutoff.
- `window_sizes`: Beyond writing out a raw, per-event flattened representation, the dataset also has
  the capability to summarize these flattened representations over the historical windows
  specified in this argument. These are strings specifying time deltas, using this syntax:
  `link`\_. Each window size will be summarized to a separate directory, and will share the same
  subject file split as is used in the raw representation files.
- `codes`: A list of codes to include in the flat representation. If `None`, all codes will be included
  in the flat representation.
- `aggs`: A list of aggregations to apply to the raw representation. Must have length greater than 0.
- `n_patients_per_sub_shard`: The number of subjects that should be included in each output file.
  Lowering this number increases the number of files written, making the process of creating and
  leveraging these files slower but more memory efficient.
- `do_overwrite`: If `True`, this function will overwrite the data already stored in the target save
  directory.
- `seed`: The seed to use for random number generation.
