Metadata-Version: 2.1
Name: align4d
Version: 1.2.3
Summary: align4d: Multi-sequence alignment tools for aligning ASR and Speaker Diarization result
Author-email: Peilin Wu <pwu54@emory.edu>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Microsoft :: Windows :: Windows 10
Classifier: Operating System :: Microsoft :: Windows :: Windows 11
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Project-URL: Bug Tracker, https://github.com/emorynlp/align4d/issues
Project-URL: Homepage, https://github.com/emorynlp/align4d

# User Instruction

## Introduction

**align4d** is a powerful Python package used for aligning text results from Speaker Diarization and Speech Recognition to gold standard transcript, especially when there are overlappings between speakers. This user manual provides a step-by-step guide on how to install, use and troubleshoot the package.

## Mechanism

The **align4d** uses global alignment algorithm that is a multi-sequence variant of Needleman-Wunsch algorithm to align hypothesis (results generated by Speaker Diarization and Speech Recognition models) to reference (usually gold standard transcript, which will be separated into multiple sequence if there are multiple speakers). The alignment happens on the token level. For long sequence the **align4d** will automatically separate the sequence into smaller segments, align them separately by finding the absolute aligned parts (called barriers), and finally assemble them together. 

The **align4d** uses Levenshtein Distance as the measurement of the similarity between tokens while doing alignment. There can be 4 situations between each position of alignment:

1. Fully match. Two tokens are exactly the same (Levenshtein Distance is 0).
2. Partially match. Two tokens are not exactly the same but the Levenshtein Distance between them are within a boundary.
3. Mismatch. Two tokens are different and the Levenshtein Distance between them exceed the boundary.
4. Gap. Only one token is present because it is aligned to a gap (insertion or deletion of tokens).

## Installation

To install **align4d**, you need to have Python version 3.10 or 3.11. Follow these steps:

1. Open your terminal or command prompt.
2. Type in the following command: `pip install align4d`
3. Wait for the package to download and install.

## Usage

### Importing align4d

To use **align4d** in your Python code, you need to import it. Here's how:

```python
from align4d import align
```

### Compile

Before actual alignment, the **align4d** is required to compile the c++ source codes distributed along with the package. 
To ensure successful compilation, the latest version of compiler that supports c++20 is required.

1. For Windows, install or update to the latest version of Visual Studio with the latest version of Visual C++ (or Visual Studio version >= 17.4.4).
2. For macOS, install or update to the latest version of Xcode with Apple Clang (or Xcode version >= 14.3 with Apple Clang version >= 14.0.3).
3. For Linux, install or update to the latest version of GCC with G++ (or GCC version >= 11.2.0).

To compile the c++ source code, use the function `align.compile()`:

```python
align.compile()
```

At this stage, do not run any alignment related functions introduced in the following sections but just run `align.compile()`.
Once it is compiled, you don't need to (and should not) run this function again while doing alignment. You do need to rerun the `align.compile()` when you switch to a new environment or reinstall the align4d.

### Aligning Text Results

**align4d** can align results from Speaker Diarization and Speech Recognition. For simple and straight forward usage, the function can be used like this:

```python
aligned_result = align.align(hypothesis, reference)
```

Here's the overview of all parameters of the function:

```python
aligned_result = align.align(hypothesis: str | list[str], reference: list[list], partial_bound: int = 2, segment_length: int = None, barrier_length: int = None, strip_punctuation: bool = True)
```

The `align()` function takes in 6 parameters, the `hypothesis` and `reference` are required and the other 4 of them are optional:

1. `hypothesis`: This is a list of strings or a string containing tokenized text . Each string represents a word that is generated from the Speech Recognition model. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.
    
    ```python
    hypothesis = ["ok", "I", "am", "a", "fish", "Are", "you", "Hello", "there", "How", "are", "you", "ok"]
    # or 
    hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
    ```
    
2. `reference`: This is a nested list of strings containing utterance and speaker labels from the gold standard text. The first string within each secondary list represents the speaker label, the second string represents the utterance. The second string can also be a list of strings where each string is a token. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.
    
    ```python
    reference = [
        ["A", "I am a fish."],
        ["B", "okay."],
        ["C", "Are you?"],
        ["D", "Hello there."],
        ["E", "How are you?"]
    ]
    # or 
    reference = [
        ["A", ["I", "am", "a", "fish."]],
        ["B", ["okay."]],
        ["C", ["Are", "you?"]],
        ["D", ["Hello", "there."]],
        ["E", ["How", "are", "you?"]]
    ]
    ```
    
3. `partial_bound`: This is an integer that specifies the boundary between partially match and mismatch in terms of the Levenshtein Distance between the two tokens in comparison. This is an optional parameter and the default value is 2.
4. `segment_length`: This is a integer that specifies the minimum length of each segment in terms of the number of hypothesis tokens. By providing `segment_length` and `barrier_length` the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters. 
    
    If `segment_length` and `barrier_length` are not provided and the hypothesis length in terms of tokens is over 100, the program will automatically search and use the optimal `segment_length` between 30 and 120
    
    If `segment_length` and `barrier_length` are not provided and the hypothesis length in terms of tokens is lower than 100, no segmentation will be performed.
    
    If `segment_length` and `barrier_length` are provided and both are integers less than or equal to 0, no segmentation will be performed.
    
    It is strongly suggested to perform auto or manual segmentation when the input sequence are long otherwise the alignment may fail because of RAM space limit.
    
    It is important that the `segment_length` and `barrier_length` need to be provided together to perform manual segmentation otherwise an Exception will be raised.
    
    ```python
    Exception: Segment length or barrier length parameter incorrect or missing.
    ```
    
5. `barrier_length`: This is an integer that specifies the length of parts in terms of number of tokens used to detect the absolute aligned parts. This is an optional parameter and the default value is 6 if the parameter is not specified. By providing `segment_length` and `barrier_length` the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.
    
    It is important that the `segment_length` and `barrier_length` need to be provided together to perform manual segmentation otherwise an Exception will be raised.
    
    ```python
    Exception: Segment length or barrier length parameter incorrect or missing.
    ```
    
6. `strip_punctuation`: This is a boolean that specifies if the **align4d** will strip all punctuation in the hypothesis and reference to provide more accurate alignment result or not. The default is set to **True** and the output will provide alignment with the original punctuation.

The `align()` function returns a dictionary containing the aligned results. The hypothesis will be the list of strings (tokens) as the value for the key “hypothesis”. The reference will be separated into multiple sequences according to the provided speaker label, where each sequence will be a list of strings (tokens) as the value for the key of their speaker labels. All the reference sequences will be contained in a secondary dictionary as the value for the key “reference” in the primary dictionary. In each list, each token is aligned to the positions that have the same index and the gap is denoted as “” (empty string). If there is punctuation in the input, the punctuation will be preserved in the output.

```python
import json

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
print(json.dumps(output, indent=4))
```

Sample output from `align()` : 

```python
# content in align_result
{
    "hypothesis": ['ok', 'I', 'am', 'a', 'fish.', 'Are', 'you?', 'Hello', 'there.', 'How', 'are', 'you?', 'ok'],
    "reference": {
        "A": ['', 'I', 'am', 'a', 'fish.', '', '', '', '', '', '', '', ''],
        "B": ['okay.', '', '', '', '', '', '', '', '', '', '', '', ''],
        "C": ['', '', '', '', '', 'Are', 'you?', '', '', '', '', '', ''],
        "D": ['', '', '', '', '', '', '', 'Hello', 'there.', '', '', '', ''],
        "E": ['', '', '', '', '', '', '', '', '', 'How', 'are', 'you?', '']
    }
}
```

### Retrieve token match result

Based on the alignment result, this tool provide function to retrieve the matching result (fully match, partially match, mismatch, gap) for each token. Use `token_match()` to retrieve the token level matching result.

The criterion for determining the matching result are the following (also mentioned in the **Mechanism**):

1. fully match: Levenshtein Distance = 0
2. partially match: Levenshtein Distance ≤ boundary (default to be 2)
3. mismatch: Levenshtein Distance > boundary (default to be 2)
4. gap: aligned to a gap

The `token_match()` requires 3 parameter, the `align_result` which is the direct return value from the previous three alignment functions, an optional parameter `partial_bound` which must be the same as the `partial_bound` used in `align()` function (default to be 2), and an optional parameter `strip_punctuation` which must be the same as the `strip_punctuation` used in `align()` function (default to be True). 

```python
hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
token_match_result = align.token_match(align_result)
print(token_match_result)
```

The return value is a list of strings that shows the token matching result and can either be fully match, partially match, mismatch, or gap.

```python
# possible output for get_token_match_result()
['mismatch', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'gap']
```

### Retrieve mapping from reference to hypothesis

Based on the alignment result, this tool provide function to retrieve the mapping from each token in the reference sequences to the hypothesis sequence. Each index shows the relative position (index) in the hypothesis sequence of the non-gap token (fully match, partially match, or mismatch) from the separated reference sequences. If the index is -1, it means that the current token does not aligned to any token in the hypothesis (align to a gap).

To achieve this, use function `align_indices()`. This function requires 2 parameters, the `align_result` which is the direct return value from the previous `align()` functionand an optional parameter `strip_punctuation` which must be the same as the `strip_punctuation` used in `align()` function (default to be True). 

```python
hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
align_indices = align.align_indices(align_result)
print(align_indices)
```

The return value is a dictionary containing list of integers that shows the mapping between tokens from separated reference to hypothesis. The integers are the indices of the tokens in reference sequence map to the hypothesis sequence (for example, the first token in sequence “C” is mapped to the token in hypothesis with index 5).

```python
# possible output
{
    'A': [1, 2, 3, 4], 
    'B': [0], 
    'C': [5, 6], 
    'D': [7, 8], 
    'E': [9, 10, 11]
}
```

## Troubleshooting
This package currently only supports Windows 10/11 x86_64, Linux x86_64 (tested with Ubuntu 22.04), and macOS (M-series processor or Intel processor). 

If you encounter any issues while using **align4d**, try the following:

1. Make sure you have installed Python version 3.10 or 3.11.
2. For compilation, make sure you have the compiler that supports c++20. The compilers can be acquired by installing:
   1. For Windows, install or update to the latest version of Visual Studio with the latest version of Visual C++ (or Visual Studio version >= 17.4.4).
   2. For macOS, install or update to the latest version of Xcode with Apple Clang (or Xcode version >= 14.3 with Apple Clang version >= 14.0.3).
   3. For Linux, install or update to the latest version of GCC with G++ (or GCC version >= 11.2.0).
3. If you have permission or access issues during compilation, please manually delete all compiled objects (ended with .so, .pyd, .dll) in the package under the same directory of align.py.
4. Do not run `align.compile()` with other align functions (`align()`, `token_match()`, `align_indices()`) at the same time.
5. Make sure you have installed the latest version of **align4d**.
6. Check the input data to make sure it is in the correct format.
   1. All the input strings must be encoded in the utf-8 format.
   2. Characters that are within the utf-8 format but not part of the natural language may affect the alignment performance. Remove them unless you are clear about the usages about these characters.
