Metadata-Version: 2.1
Name: corpus-preprocess
Version: 0.0.2
Summary: Building blocks for spacy custom tokenization and Matcher patterns
Home-page: https://mv3.dev
Author: Marcelino G. Veloso III
Author-email: contact@mv3.dev
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: inflect (>=7.0,<8.0)
Requires-Dist: ipykernel (>=6.27.1,<7.0.0)
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: roman (>=4.1,<5.0)
Requires-Dist: spacy[apple] (>=3.7.2,<4.0.0)
Requires-Dist: sqlite-utils (>=3.36,<4.0)
Project-URL: Documentation, https://justmars.github.io/corpus-preprocess
Project-URL: Repository, https://github.com/justmars/corpus-preprocess
Description-Content-Type: text/markdown

# corpus-preprocess

![Github CI](https://github.com/justmars/corpus-preprocess/actions/workflows/main.yml/badge.svg)

Utility functions to preprocess Phil. legalese in [weasel](https://github.com/explosion/weasel)-based flows:

1. [lexcat-proj](https://github.com/justmars/lexcat-proj); and
2. [lexcat-multi](https://github.com/justmars/lexcat-multi)

> [!IMPORTANT]
> Relies on a private [corpus-assets](https://github.com/justmars/corpus-assets) to be cloned locally.

`corpus-assets` folder should have the following structure:

```yml
- data: # used as data folder in tokenization
  - single_tokens.json
  - report_publishers.json
- ents: # collected in `setup_span_ruler.py`
  - casenames.txt # each line is a clean case
  - clean_statute_titles.txt # each line is a clean title
- concepts: # collected in `setup_span_ruler.py`
  - political: # main subject category
      - bill_of_rights: # sub-topic
          - patterns.json # contains matcher files
          - q.txt # contains lines which can be used to query the database
- metas: # collected in `setup_span_ruler.py`
  - artifacts:
    - axiom:
      - patterns.json # same
      - q.txt # same
```

## Custom tokenizer

```py
import spacy

@spacy.registry.tokenizers("lex.tokenizer.v1")  # type: ignore
def lex_tokenize(data_folder: str):
    """
    The tokenizer:

    1. Removes dashes from infixes
    2. Adds prefix/suffix rules for parenthesis/brackets
    3. Adds special exceptions via the `data_folder`
    """
    def modify_tokenizer(nlp):
        data = import_data_tokens(validated_path(data_folder))
        nlp.tokenizer = customize_tokenizer(data)
        return nlp.tokenizer

    return modify_tokenizer


def create_base_nlp(base_model: str, data_folder: str):
    """
    Need to declare a new empty model to have custom tokenization then
    plug pipeline with required parts of a pre-trained model. The data
    folder modifies the tokenizer via `nlp.tokenizer.add_special_rules()`
    """
    nlp = spacy.blank(
        name="en",
        config={
            "nlp": {
                "tokenizer": {
                    "@tokenizers": "lex.tokenizer.v1",
                    "data_folder": data_folder, # add special rules from third-party source
                }
            }
        },
    )
    source_nlp = spacy.load(base_model, exclude="ner,senter")
    nlp.vocab.vectors = source_nlp.vocab.vectors
    for name in source_nlp.pipe_names:
        nlp.add_pipe(name, source=source_nlp)
    return nlp
```

## SpanRuler from assets

Use in tandem with tokenizer, ensure only longest spans:

```py
from spacy.language import Language
from spacy.util import filter_spans
from preprocess import set_patterns_from_assets
import spacy

@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"}, validate=True) # defaults to 'ruler' key
patterns = set_patterns_from_assets(folder)
ruler.add_patterns(patterns)
nlp.add_pipe("filter_added_spans") # ensures only longest spans are included
nlp.to_disk("models/")  # will save entire directory which includes the pipeline
```

## Processes

### Generate queries

The `q.txt` lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. `/scripts/extract.py`
utilizes [table.search()](https://sqlite-utils.datasette.io/en/stable/python-api.html#searching-with-table-search).

See [code](./corpus_preprocess/asset_extractors.py)

```py
def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path),
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows
```

### Create matcher patterns

A [SpanRuler](https://spacy.io/api/spanruler) component will be based on `patterns.json` (with `q.txt` as phrases). These patterns are aggregated via `set_patterns_from_assets()`. See [code](./corpus_preprocess/setup_span_ruler.py):

```py
def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents
```

### Categorize queried segments via patterns found

A [TextCategorizer](https://spacy.io/api/textcategorizer) component can be trained using the results of the span ruler: see sample code:

```py
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, path: str):
        self.nlp = nlp
        options = list({p["id"].split("/")[0] for p in create_patterns(path)})  # type: ignore
        if len(options) == 1:
            options.append(f"not_{options[0]}")
        self.options = options

    def __call__(self, doc) -> Doc:
        default = {op: 0.0 for op in self.options}
        cats = [self.nlp.vocab.strings[s.id].split("/")[0] for s in doc.spans["sc"]]
        doc.cats = default | {k: 1.0 for k, _ in Counter(cats).items()}
        return doc
```

> [!NOTE]
> Note: if textcat is in the pipeline, if only one label is found, will error out, hence need to a _not_ option. If textcat_multilabel is used, then a single category is fine.

## Prerequisites to lexcat-*

item | desc | `project.yml` declaration
-- | -- | --
db | sqlite database to fetch segments[^1] | `db_file`
corpus-assets | A folder to retrieve q.txt and patterns.json files | `patterns_dir`
corpus-preprocess | This toolkit | see usage in `/scripts/build.py` and `/scripts/extract.py`

[^1]: Although it might be better to allow segment access via lawsql's API.

## Installation of lexcat-*

Clone the above repos and activate virtual env with `requirements.txt`:

```sh
python -m venv .venv && \
source .venv/bin/activate && \
python -m pip install -U pip && \
python -W ignore -m pip install -r requirements.txt && \
weasel run init
```

## lexcat-proj

- Results in a model trained on a specific **concept** category
- Need to adjust [project.yml](project.yml)'s **name**, **topic_dir**, and **total_segments** variables (`vars`).
- Running `weasel run all` produces packages/en_lex_`name`_`total_segments`-0.0.0/dist
- The output is based on _q.txt_ and _patterns.json_ files sourced from e.g. ../patterns/`topic_dir`.
- Alternatively, can override CLI arguments, e.g. `weasel run all . --vars.topic_dir <value> --vars.name <value>`

### broad implementation

topic | name | status
-- | -- | --
political | pol | ok
labor | labor | ok
criminal | crim | ok
civil | civ | ok
remedial | rem | -
commercial | com | -
ethics | eths | -
remedial | rem | -

Example use on command line (note `.`)[^2]:

```sh
weasel run all . \
    --vars.topic_dir criminal \
    --vars.name crim \
    --vars.total_segments 5000
```

[^2]: The override of weasel project variables `vars` on the command line [requires tinkering](https://github.com/explosion/spaCy/issues/8818)

### granular implementation

topic | name | status
-- | -- | --
political/review | name=pol_rev | ok
political/sovereignty| name=pol_sov | ok
political/bill_of_rights | name=pol_bill | ok
political/administrative | name=pol_adm | ok

Example use on command line (note `.`):

```sh
weasel run all . \
    --vars.topic_dir political/administrative \
    --vars.name pol_adm \
    --vars.total_segments 250
```

## lexcat-multi

- Results in a model trained on all **concept** categories
- Each category's example files are found in assets
- Running `weasel run all` produces packages/en_`lexcat`-0.0.0/dist
- The output is based on _q.txt_ and _patterns.json_ files sourced from e.g. ../`patterns` (the parent directory)

## Models

There are two models to consider, both will be created under `/training`

### rule-based, weak supervision via keywords

1. The first model is a _rule-based_ temporary model.
2. Basic pipeline makes use of a tokenizer and SpanRuler to make adjustments to _doc.spans_.
3. The pipeline is applied to segments fetched from the database.
4. The model is built via the `scripts/build.py`.
5. The config of this model can be seen in `/training/{name}_ruler/config.cfg`
6. The purpose of this model is seen in the `weasel run bin` to output a `corpus/train.spacy`.

### lexcat, generate model to test on prodigy

1. The second model is the _statistical_ "training" `lexcat`.
2. Utilizes a separate `lexcat_proj/config.cfg` with output `corpus/train.spacy` (from _rule-based_ model).
3. The purpose of this model, found in `training/lexcat/model-best` after `weasel run train`, is to package it for later use.
4. This model becomes a weak supervision model that can be checked by human annotators later.

## Packaged models

Install via filepath, e.g.

```sh
pip install ../lexcat-proj/packages/en_lex_labor_5000-0.0.0/dist/en_lex_labor_5000-0.0.0.tar.gz # poetry add
```

This will enable:

```py
nlp = spacy.load('en_lex_labor_5000')
```

## Gotchas

### spacy

1. Cannot override [pre-defined models' tokenization](https://stackoverflow.com/a/66441287/9081369), can only create custom tokenization with an empty model

### weasel

1. See CLI overrides in [weasel, previously spacy projects](https://github.com/explosion/spaCy/issues/8818)
2. Too many warnings so note the `-W ignore` option used in running python command line scripts.
3. Do not name a script/function.py file named `tokenize.py`, this results in `AttributeError: partially initialized module 'inspect' has no attribute 'getmro' (most likely due to a circular import)`
4. Although `project.yml` produces output Markdown formatting, it will not respect full markdown formatting (e.g. headers, tables, enumerations) within `project.yml` fields like description, hence need for this NOTES.md file as a supplement.
5. In creating Language.factories configs, using `Path` as a type results in `Fatal Python error: Segmentation fault`.

