Metadata-Version: 2.4
Name: cdiscbuilder
Version: 1.2.1
Summary: A package to convert ODM XML to SDTM/ADaM Datasets
Author-email: Ming-Chun Chen <hellomingchun@gmail.com>
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: pyyaml
Requires-Dist: polars>=0.20.0
Requires-Dist: reactable>=0.1.6
Dynamic: license-file

# CDISC Builder

**`cdiscbuilder`** is a Python package designed to simplify the transformation of clinical trial data from **ODM (Operational Data Model)** XML format into **CDISC SDTM (Study Data Tabulation Model)** and **ADaM (Analysis Data Model)** datasets.

It provides a flexible, configuration-driven approach to data mapping, allowing users to define rules using simple YAML files or Python dictionaries without harcoding complex logic.

## Key Features

-   **ODM XML Parsing**: Efficiently parses CDISC ODM strings and files into workable dataframes.
-   **Configurable Mappings**: Define your mapping rules (source columns, hardcoded values, custom logic) in YAML.
-   **Schema Validation**: Ensures your configuration files adhere to strict standards before processing.
-   **Metadata-Driven Findings**: Powerful processor for Findings domains (VS, LB, FA, etc.) using granular metadata.
-   **Excel/Parquet Output**: Generates regulatory-compliant datasets in modern formats.

## Installation

```bash
pip install cdiscbuilder
```

## Quick Start

### 1. Command Line Interface

You can generate datasets directly from your terminal:

```bash
# Generate SDTM datasets from an ODM XML file
cdisc-sdtm --xml study_data.xml --output ./sdtm_data
```

### 2. Python API

```python
from cdiscbuilder.sdtm import create_sdtm_datasets

# Define paths
xml_file = "study_data.xml"
config_dir = "path/to/my/specs" 
output_dir = "./sdtm_outputs"

# Generate Datasets
create_sdtm_datasets(config_dir, xml_file, output_dir)
```

## Configuration

For detailed and complete references on how to structure mapping specifications, see:
* **[SDTM Mapping Specification](file:///c:/Users/mingc/Documents/projects/cdiscbuilder/cdisc_builder/docs/SDTM_MAPPING_SPECIFICATION.md)**: Details on the intermediate `odm_long.csv` schema, wide vs. findings domains, regex extraction, and validation rules.
* **[ADaM Mapping Specification](file:///c:/Users/mingc/Documents/projects/cdiscbuilder/cdisc_builder/docs/ADAM_MAPPING_SPECIFICATION.md)**: Details on ADaM yaml configuration schema, inheritance, SQL-like derivations, conditions, and aggregations.

The package comes with standard configurations for common domains (`DM`, `AE`, `VS`, etc.) in `src/cdisc_builder/specs`. You can override these or add new ones by creating your own configuration directory.

### Example YAML (`DM.yaml`)

```yaml
DM:
    - formoid: "FORM.DEMOG"
      keys: ["StudyOID", "StudySubjectID"]
      columns:
          STUDYID:
              source: StudyOID
              type: str
          USUBJID:
              source: StudySubjectID
              prefix: "PPT-"
              type: str
          AGE:
              source: IT.AGE
              type: int
              type: str
          SEX:
              source: I_DEMOG_SEX
              type: str
              value_mapping:
                  "M": "Male"
                  "F": "Female"

```
### Finding Domains (Dynamic Mapping)

For domains like `IE`, `LB`, `FA` where many input items map to a single `Test Code` / `Test Name` pair, use `type: finding`.

```yaml
IE:
  - type: finding
    formoid: "F_ELIGIBILITY"
    # Filter rows using Regex
    item_group_regex: "IG_ELIGI_.*"
    item_oid_regex: "I_ELIGI_.*"
    
    columns:
      # Extract part of the OID for the Short Code
      IETESTCD:
        source: ItemOID
        regex_extract: "I_ELIGI_(.*)"
      
      # Use Metadata from parsed XML for the Description
      IETEST:
        source: Metadata.Question
      
      IEORRES:
        source: Value
```

### Advanced Mapping Features

**Prefixing**:
```yaml
USUBJID:
  source: StudySubjectID
  prefix: "PPT-"
```

**Substring Extraction** (extracts chars 3-5 before mapping):
```yaml
SITEID:
  source: FULL_ID
  substring_start: 3
  substring_length: 3

**Fallback** (use secondary source if primary is missing):
```yaml
SUBJID:
  source: RFSTDTC
  fallback: USUBJID
```
```

**Default Values**:
```yaml
CUSTOM_COL:
  source: ORG_COL
  value_mapping:
    "A": "Alpha"
  mapping_default: "Other" # used if not A
  # mapping_default_source: "AnotherCol" # Fallback to column value
```

**Case Sensitive Mapping**:
By default, mapping is case-sensitive. You can disable this to map values regardless of case (e.g. "Yes", "yes", "YES" -> "Y"). Unmapped values preserve their original casing.
```yaml
RESP:
  source: INPUT_VAL
  value_mapping:
    "Yes": "Y"
    "No": "N"
  case_sensitive: false
```

## Development

This project uses modern tools for testing and maintaining code quality:

### 1. Running Tests
Run the automated test suite using `pytest`:
```bash
# Using standard pip/virtual environment
pytest

# Using uv
uv run pytest
```

### 2. Code Quality
We use `black` for code formatting, `ruff` for linting, and `mypy` for type checking:

```bash
# Code Formatting (in-place rewrite)
uv run --with black black src/

# Linting and style checks
uv run --with ruff ruff check src/

# Type Checking
uv run --with mypy mypy src/
```

## License

[MIT License](LICENSE)
