readabs.splice

Priority splicing of mixed-frequency time series.

This module has two layers:

splice The core primitive. Deliberately source-agnostic: it takes pandas Series you have already fetched (by description, by ID, however you like) and splices them into one series. It knows nothing about the ABS, ships no static lookup table, and makes no guesses about which series belong together — that judgement stays with the caller.

select / select_one / select_and_splice A thin ABS-aware convenience layer over splice. Each resolves (data, meta, selector) sources to Series via readabs.find_abs_id (carrying each series' ABS unit on .attrs["unit"]), so the common case — splice a few ABS series selected by description/frequency — is one call, while select stays exposed for when you need a transform between selecting and splicing.

Splice design

Given an ordered list of segments (highest priority / most authoritative first), splice():

  1. align — put every segment on one common PeriodIndex. By default the grid is the finest frequency present, which dissolves anchor clashes (Q-NOV vs Q-DEC, A-JUN vs A-DEC) because every coarse period maps cleanly onto a finer one. Coarser segments are placed at their period-end; finer segments are aggregated down with agg.
  2. rebase(opt-in; off by default) for each junction, multiplicatively scale the lower-priority segment so its level matches the running result over the overlapping date window (phase-agnostic; works even when two series never share an exact period). Falls back to a single junction point if there is no overlap, and flags it. Off by default because it transforms your data — nothing is silently rescaled unless you ask.

             Rebasing assumes **ratio-scale** inputs — series whose zero is
             meaningful and whose discrepancy between segments is
             *proportional*.  Indexes (CPI, price/volume indices on different
             base periods) are the canonical case; a proportional benchmark
             revision of a count works too.  It is **wrong** for series that
             cross zero (rates of change, balances, net flows) or whose
             segments differ by an *additive* offset rather than a scale
             factor — a negative or non-finite factor is caught and raises.
             With ``rebase=False`` (the default) the raw levels are coalesced
             as-is: if two same-unit segments already agree, rebasing only
             invents a discrepancy to "correct".
    
  3. coalescecombine_first down the priority chain: take segment 1, fill gaps from segment 2, then 3, ... The result keeps only the periods that actually carry data — a coarse back-history stays sparse on a finer grid rather than being NaN-filled, and nothing is interpolated (pass fill= to densify).
  4. resample— (optional) resample the spliced result to a chosen output frequency/anchor.

The returned join report makes every rebase factor and overlap visible, so a splice can be audited rather than trusted blindly.

  1"""Priority splicing of mixed-frequency time series.
  2
  3This module has two layers:
  4
  5``splice``
  6    The core primitive.  Deliberately *source-agnostic*: it takes pandas Series
  7    you have already fetched (by description, by ID, however you like) and
  8    splices them into one series.  It knows nothing about the ABS, ships no
  9    static lookup table, and makes no guesses about which series belong together
 10    — that judgement stays with the caller.
 11
 12``select`` / ``select_one`` / ``select_and_splice``
 13    A thin ABS-aware convenience layer over ``splice``.  Each resolves
 14    ``(data, meta, selector)`` sources to Series via ``readabs.find_abs_id``
 15    (carrying each series' ABS unit on ``.attrs["unit"]``), so the common case —
 16    splice a few ABS series selected by description/frequency — is one call,
 17    while ``select`` stays exposed for when you need a transform between
 18    selecting and splicing.
 19
 20Splice design
 21-------------
 22Given an ordered list of segments (highest priority / most authoritative
 23first), :func:`splice`:
 24
 251. **align**   — put every segment on one common ``PeriodIndex``.  By default
 26                 the grid is the *finest* frequency present, which dissolves
 27                 anchor clashes (Q-NOV vs Q-DEC, A-JUN vs A-DEC) because every
 28                 coarse period maps cleanly onto a finer one.  Coarser segments
 29                 are placed at their period-*end*; finer segments are
 30                 aggregated down with ``agg``.
 312. **rebase**  — *(opt-in; off by default)* for each junction,
 32                 *multiplicatively* scale the lower-priority segment so its level
 33                 matches the running result over the *overlapping date window*
 34                 (phase-agnostic; works even when two series never share an exact
 35                 period).  Falls back to a single junction point if there is no
 36                 overlap, and flags it.  Off by default because it transforms
 37                 your data — nothing is silently rescaled unless you ask.
 38
 39                 Rebasing assumes **ratio-scale** inputs — series whose zero is
 40                 meaningful and whose discrepancy between segments is
 41                 *proportional*.  Indexes (CPI, price/volume indices on different
 42                 base periods) are the canonical case; a proportional benchmark
 43                 revision of a count works too.  It is **wrong** for series that
 44                 cross zero (rates of change, balances, net flows) or whose
 45                 segments differ by an *additive* offset rather than a scale
 46                 factor — a negative or non-finite factor is caught and raises.
 47                 With ``rebase=False`` (the default) the raw levels are coalesced
 48                 as-is: if two same-unit segments already agree, rebasing only
 49                 invents a discrepancy to "correct".
 503. **coalesce**— ``combine_first`` down the priority chain: take segment 1,
 51                 fill gaps from segment 2, then 3, ...  The result keeps only
 52                 the periods that actually carry data — a coarse back-history
 53                 stays sparse on a finer grid rather than being NaN-filled, and
 54                 nothing is interpolated (pass ``fill=`` to densify).
 554. **resample**— (optional) resample the spliced result to a chosen output
 56                 frequency/anchor.
 57
 58The returned join report makes every rebase factor and overlap visible, so a
 59splice can be audited rather than trusted blindly.
 60"""
 61
 62from __future__ import annotations
 63
 64import math
 65from collections.abc import Iterable, Sequence
 66from typing import Literal, cast
 67
 68import pandas as pd
 69from pandas import DataFrame, PeriodIndex, Series
 70
 71from readabs.search_abs_meta import find_abs_id  # used by the select() layer
 72
 73# Frequency rank — higher number = finer frequency.
 74_FREQ_RANK: dict[str, int] = {"Y": 0, "A": 0, "Q": 1, "M": 2, "W": 3, "D": 4}
 75
 76
 77def _base(freqstr: str) -> str:
 78    """Return the base frequency character (``"Q-NOV"`` -> ``"Q"``, ``"A-JUN"`` -> ``"Y"``)."""
 79    char = freqstr.split("-", maxsplit=1)[0][0].upper()
 80    return "Y" if char == "A" else char
 81
 82
 83def _rank(freqstr: str) -> int:
 84    """Return the frequency rank for a PeriodIndex freq string."""
 85    return _FREQ_RANK[_base(freqstr)]
 86
 87
 88def _as_period_index(s: Series) -> Series:
 89    """Ensure *s* has a PeriodIndex; convert from DatetimeIndex if needed."""
 90    if isinstance(s.index, PeriodIndex):
 91        return s
 92    if isinstance(s.index, pd.DatetimeIndex):
 93        return s.set_axis(s.index.to_period())
 94    raise TypeError(f"Series '{s.name}' must have a PeriodIndex or DatetimeIndex, got {type(s.index).__name__}.")
 95
 96
 97def _pidx(s: Series) -> PeriodIndex:
 98    """Return *s*'s index as a (typed) PeriodIndex, converting if necessary."""
 99    return cast("PeriodIndex", _as_period_index(s).index)
100
101
102def _pick_target(segments: Sequence[Series]) -> str:
103    """Choose the default common-grid freq: the finest present.
104
105    If two or more segments share the *finest* rank but with different anchors
106    (e.g. ``Q-NOV`` and ``Q-DEC``) and there is nothing finer to splice them
107    onto, raise — picking one anchor would silently reanchor the other and
108    could assume wrong.  Resolve it by passing a finer ``target`` (e.g.
109    ``"M"``), or by including a finer-frequency segment.
110    """
111    freqs = [str(_pidx(s).freqstr) for s in segments]
112    ranks = [_rank(f) for f in freqs]
113    top = max(ranks)
114    top_freqs = {f for f, r in zip(freqs, ranks, strict=True) if r == top}
115    if len(top_freqs) > 1:
116        raise ValueError(
117            f"Clashing anchors at the finest frequency: {sorted(top_freqs)}. "
118            f"Pass a finer target (e.g. target='M') to splice them on a common grid."
119        )
120    return next(iter(top_freqs))
121
122
123def _to_grid(s: Series, target: str, agg: str) -> Series:
124    """Map *s* onto the *target* PeriodIndex frequency.
125
126    Finer-than-target segments are aggregated down with *agg*; equal-or-coarser
127    segments are placed at their period-end on the target grid.
128    """
129    s = _as_period_index(s).dropna()
130    idx = cast("PeriodIndex", s.index)
131    src = str(idx.freqstr)
132    if _rank(src) > _rank(target):
133        # finer -> coarser: aggregate the sub-periods that fall in each target period
134        out = s.groupby(idx.asfreq(target)).agg(agg)
135    elif _rank(src) == _rank(target) and _base(src) == _base(target) and src != target:
136        # same frequency, different anchor (e.g. Q-NOV vs Q-DEC) — reanchoring
137        # would silently shift every period, so refuse rather than assume.
138        raise ValueError(
139            f"Cannot place '{s.name}' ({src}) onto a {target} grid without reanchoring. "
140            f"Use a finer target (e.g. target='M')."
141        )
142    else:
143        # coarser (or identical) -> place each value at its period-end on the grid
144        out = Series(s.to_numpy(), index=idx.asfreq(target, how="E"), name=s.name)
145        out = out[~out.index.duplicated(keep="last")]
146    return out.sort_index()
147
148
149def _rebase_factor(
150    result: Series, seg: Series
151) -> tuple[float, str, int, pd.Period | None, pd.Period | None]:
152    """Compute the factor to bring *seg* onto *result*'s level.
153
154    Measured as the ratio of mean levels over the overlapping *date span*, so
155    it is phase-agnostic — it works even when the two series share no exact
156    period (e.g. Q-NOV vs Q-DEC mapped onto a monthly grid).  Falls back to a
157    single junction point when the spans do not overlap at all.
158
159    Returns ``(factor, method, overlap_n, window_start, window_end)``.
160    """
161    r, s = result.dropna(), seg.dropna()
162    if len(r) and len(s):
163        lo = max(r.index.min(), s.index.min())
164        hi = min(r.index.max(), s.index.max())
165        if lo <= hi:
166            r_win, s_win = r.loc[lo:hi], s.loc[lo:hi]
167            if len(r_win) and len(s_win) and s_win.mean():
168                return float(r_win.mean() / s_win.mean()), "window", min(len(r_win), len(s_win)), lo, hi
169    # No overlapping span — fall back to the nearest junction point.
170    r0 = result.first_valid_index()
171    if r0 is not None:
172        before = s.loc[:r0]
173        if len(before) and before.iloc[-1]:
174            return float(result.loc[r0] / before.iloc[-1]), "junction", 0, None, None
175    return 1.0, "none", 0, None, None
176
177
178def splice(
179    segments: Iterable[Series],
180    *,
181    target: str | None = None,
182    rebase: bool = False,
183    agg: str = "mean",
184    output: str | None = None,
185    fill: Literal["ffill", "interpolate"] | None = None,
186    name: str | None = None,
187) -> tuple[Series, DataFrame]:
188    """Splice mixed-frequency *segments* into one series, highest priority first.
189
190    Parameters
191    ----------
192    segments
193        Ordered list of pandas Series (PeriodIndex or DatetimeIndex).  The
194        first is highest priority: it wins where periods overlap and (when
195        ``rebase`` is on) sets the level everything else is rebased to.
196    target
197        Common-grid frequency (e.g. ``"M"``, ``"Q-DEC"``).  Defaults to the
198        finest frequency present (anchor clashes step one rank finer).
199    rebase
200        Off by default — segments are coalesced at their **raw** levels, with no
201        silent transformation of your data.  Set ``True`` to *multiplicatively*
202        rescale each lower-priority segment to the running result's level before
203        coalescing.  Rebasing assumes **ratio-scale** inputs (meaningful zero,
204        proportional discrepancy between segments) — splicing index series on
205        different base periods (CPI, price/volume indices) is the case that
206        needs it.  It is wrong for zero-crossing series (rates, balances) or
207        additive level breaks, and it *invents* a correction when same-unit
208        segments already agree — which is why it is opt-in.  A non-finite or
209        non-positive factor raises.  See the module docstring's *rebase* step.
210    agg
211        Aggregator used when a segment is finer than the grid (or when
212        downsampling to *output*).  ``"mean"`` for index levels; use ``"sum"``
213        for flows.
214    output
215        Optional final frequency to resample the spliced result to.
216    fill
217        Optional gap fill.  By default (``None``) the result contains only the
218        periods that actually have data — no NaN rows are inserted for the gaps
219        a coarse segment leaves on a finer grid, and nothing is interpolated.
220        ``"ffill"`` or ``"interpolate"`` densify the result onto the full grid
221        first and then fill.
222    name
223        Name for the result series (defaults to the first segment's name).
224
225    Returns
226    -------
227    tuple[Series, DataFrame]
228        The spliced series and a one-row-per-junction report.
229
230    """
231    segments = list(segments)
232    if not segments:
233        raise ValueError("splice() needs at least one segment.")
234
235    grid = target or _pick_target(segments)
236    on_grid = [_to_grid(s, grid, agg) for s in segments]
237
238    result = on_grid[0].copy()
239    rows: list[dict[str, object]] = []
240    for i, seg in enumerate(on_grid[1:], start=1):
241        if rebase:
242            factor, method, n, lo, hi = _rebase_factor(result, seg)
243            # Multiplicative rebasing assumes ratio-scale inputs.  A non-finite
244            # factor (near-zero denominator) or a non-positive one (the overlap
245            # means have opposite signs, which would flip the back-history) means
246            # the data is not ratio-scale — fail loud rather than ship it.  A
247            # large *magnitude* is fine: a legitimate base-period difference can
248            # need a 50x factor, so only sign and finiteness are guarded.
249            if not (math.isfinite(factor) and factor > 0):
250                raise ValueError(
251                    f"splice: rebase factor for segment {i} ('{seg.name}') is {factor} over "
252                    f"{lo}..{hi}. Multiplicative rebasing needs ratio-scale inputs (meaningful "
253                    f"zero, proportional discrepancy); a non-finite or non-positive factor means "
254                    f"the segments cross zero or differ additively. Pass rebase=False to coalesce "
255                    f"raw levels instead."
256                )
257        else:
258            factor, method, n, lo, hi = 1.0, "off", 0, None, None
259        seg_rebased = seg * factor
260        rows.append(
261            {
262                "segment": i,
263                "name": str(seg.name),
264                "freq_in": str(_pidx(segments[i]).freqstr),
265                "method": method,
266                "overlap_n": n,
267                "window_start": str(lo) if lo is not None else "",
268                "window_end": str(hi) if hi is not None else "",
269                "factor": round(factor, 6),
270                "fills_from": str(seg.dropna().index.min()),
271            }
272        )
273        result = result.combine_first(seg_rebased)
274
275    # By default keep only the periods that actually carry data: do NOT reindex
276    # onto a dense grid (which would manufacture NaN for the gaps a coarse
277    # back-history leaves on a finer grid) and do NOT interpolate.  A long-run
278    # series therefore stays sparse where it is old and coarse, and plots as one
279    # continuous line with no holes and no invented points.
280    result = result.dropna().sort_index()
281
282    if output and output != grid:
283        result = _to_grid(result, output, agg).dropna().sort_index()
284        grid = output
285
286    if fill in ("ffill", "interpolate") and len(result):
287        # Explicit opt-in: densify onto the full grid, then fill.
288        full = pd.period_range(result.index.min(), result.index.max(), freq=grid)
289        result = result.reindex(full)
290        result = result.ffill() if fill == "ffill" else result.interpolate()
291
292    result.name = name or str(segments[0].name)
293    report = DataFrame(rows)
294    return result, report
295
296
297# A select_and_splice() source: the fetched data dict, its meta, and a
298# {search_value: meta_column} selector (readabs' find_abs_id convention).
299Source = tuple[dict[str, DataFrame], DataFrame, dict[str, str]]
300
301
302def select_one(data: dict[str, DataFrame], meta: DataFrame, selector: dict[str, str]) -> Series:
303    """Select the single Series for one ``(data, meta, selector)`` — the single-source wrapper.
304
305    Convenience for the common one-selector case; equivalent to
306    ``select([(data, meta, selector)])[0]``.  Returns the Series named by its
307    Series ID, with its ABS unit on ``.attrs["unit"]``.
308    """
309    table, series_id, unit = find_abs_id(meta, selector, validate_unique=True)
310    s = data[table][series_id].copy()
311    s.name = series_id
312    s.attrs["unit"] = str(unit)
313    return s
314
315
316def select(sources: Iterable[Source], *, require_same_units: bool = True) -> list[Series]:
317    """Select a series for each ``(data, meta, selector)`` — the iterable in, iterable out.
318
319    The composable selection primitive: takes the iterable of ``(data, meta,
320    selector)`` sources and returns the matching list of Series, ready to hand to
321    :func:`splice` (directly, or after a per-series transform).  Each selection
322    goes through ``readabs.find_abs_id`` with ``validate_unique=True``, which
323    de-duplicates on Series ID first — so a selector matching the same series in
324    several tables resolves cleanly, while one matching two genuinely different
325    series raises rather than guessing.
326
327    Parameters
328    ----------
329    sources
330        Iterable of ``(data, meta, selector)``:
331
332        - ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
333        - ``meta``   — the matching metadata DataFrame.
334        - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g.
335          ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
336          "Index Numbers": mc.unit, "Quarter": mc.freq}``.
337    require_same_units
338        If ``True`` (default) **raise** when the selected series do not all share
339        the same ABS unit — units must cohere to be spliced.  Set ``False`` when
340        you deliberately select different-unit series together (e.g. two counts
341        and a rate that you will combine yourself).
342
343    Returns
344    -------
345    list[Series]
346        One Series per source, each named by its Series ID with its ABS unit in
347        ``series.attrs["unit"]``.  Unpack it (``a, b = select([...])``), map a
348        transform over it, or pass it straight to :func:`splice`.  A later
349        transform drops the unit attr — correctly, since the unit is then no
350        longer the ABS one.
351
352    Raises
353    ------
354    ValueError
355        If ``require_same_units`` and the selected series carry mixed units.
356
357    """
358    segments = [select_one(data, meta, selector) for data, meta, selector in sources]
359    if require_same_units:
360        units = [str(s.attrs.get("unit", "")) for s in segments]
361        if len(set(units)) > 1:
362            detail = ", ".join(f"{s.name}={u!r}" for s, u in zip(segments, units, strict=True))
363            raise ValueError(
364                f"select: selected series have mismatched units ({detail}). Pass "
365                f"require_same_units=False to select different-unit series together."
366            )
367    return segments
368
369
370def select_and_splice(
371    sources: Iterable[Source],
372    *,
373    target: str | None = None,
374    rebase: bool = False,
375    agg: str = "mean",
376    output: str | None = None,
377    fill: Literal["ffill", "interpolate"] | None = None,
378    name: str | None = None,
379    require_same_units: bool = True,
380) -> tuple[Series, str, DataFrame]:
381    """Select one series per source and :func:`splice` them — the no-transform case.
382
383    Sugar for ``splice(select(*src) for src in sources)`` with a unit guard.  When
384    you need a transform *between* selecting and splicing (e.g. a growth rate),
385    compose :func:`select` and :func:`splice` directly instead — that is the whole
386    reason :func:`select` is exposed separately.
387
388    Parameters
389    ----------
390    sources
391        Ordered iterable of ``(data, meta, selector)``, **highest priority
392        first** (same priority rule as :func:`splice`):
393
394        - ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
395        - ``meta``   — the matching metadata DataFrame.
396        - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``,
397          e.g. ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
398          "Index Numbers": mc.unit, "Quarter": mc.freq}``.  In the common case
399          the only thing differing between two sources is the frequency, so a
400          shared *base* selector composes with ``base | {"Quarter": mc.freq}``.
401    target, rebase, agg, output, fill, name
402        Passed straight through to :func:`splice`.
403    require_same_units
404        Forwarded to :func:`select`: if ``True`` (default) raise when the
405        selected segments carry mixed units; ``False`` overrides (the result is
406        then labelled with the highest-priority segment's unit).
407
408    Returns
409    -------
410    tuple[Series, str, DataFrame]
411        The spliced series, its unit (the highest-priority segment's unit), and
412        the :func:`splice` join report, augmented with ``series_id`` and
413        ``unit`` columns recording what each segment resolved to.
414
415    """
416    segments = select(sources, require_same_units=require_same_units)
417    units = [str(s.attrs.get("unit", "")) for s in segments]
418
419    result, report = splice(
420        segments, target=target, rebase=rebase, agg=agg, output=output, fill=fill, name=name
421    )
422    # Audit trail: which Series ID / unit did each reported (lower-priority) segment use?
423    if len(report):
424        seg = [int(i) for i in report["segment"]]
425        report.insert(1, "series_id", [str(segments[i].name) for i in seg])
426        report.insert(2, "unit", [units[i] for i in seg])
427    return result, units[0], report
428
429
430# ---------------------------------------------------------------------------
431# Self-tests — `python splice.py`
432# ---------------------------------------------------------------------------
433if __name__ == "__main__":
434    import numpy as np
435
436    def _show(title: str, s: Series, rep: DataFrame) -> None:
437        print(f"\n{'=' * 70}\n{title}\n{'=' * 70}")
438        print(
439            f"freq={cast('PeriodIndex', s.index).freqstr}  n={len(s)}  non-null={s.notna().sum()}  "
440            f"range={s.index.min()}..{s.index.max()}"
441        )
442        if len(rep):
443            print(rep.to_string(index=False))
444
445    # --- Case 1: monthly (new) + quarterly (old), level shift via index rebase
446    q = Series(
447        np.arange(100, 100 + 4 * 20, dtype=float),  # 20 years quarterly, base ~100
448        index=pd.period_range("2000Q1", periods=80, freq="Q-DEC"),
449        name="cpi",
450    )
451    m = Series(
452        np.arange(50.0, 50.0 + 60) * 0.5 + 130,  # monthly on a *different* base
453        index=pd.period_range("2018-01", periods=60, freq="M"),
454        name="cpi",
455    )
456    out, rep = splice([m, q], rebase=True)  # monthly priority, quarterly fills the back-history
457    _show("Case 1 — M (priority) spliced with Q-DEC, auto-grid", out, rep)
458    print(
459        f"check: rebased Q value at 2018-03 = {out.loc['2018-03']:.3f} "
460        f"(monthly 2018-01 = {m.iloc[0]:.3f})"
461    )
462
463    # --- Case 2: the anchor clash — Q-NOV vs Q-DEC, overlapping in time
464    q_dec = Series(
465        np.arange(200.0, 200 + 40),
466        index=pd.period_range("2010Q1", periods=40, freq="Q-DEC"),
467        name="x",
468    )
469    q_nov = Series(
470        np.arange(80.0, 80 + 60),  # 2000Q1..2014Q4 — overlaps q_dec over 2010-2014
471        index=pd.period_range("2000Q1", periods=60, freq="Q-NOV"),
472        name="x",
473    )
474    print(f"\n{'=' * 70}\nCase 2 — Q-DEC + Q-NOV anchor clash\n{'=' * 70}")
475    try:
476        splice([q_dec, q_nov])  # no target -> must refuse rather than reanchor
477    except ValueError as exc:
478        print(f"default (no target) correctly raised:\n  {exc}")
479    out2, rep2 = splice([q_dec, q_nov], target="M", rebase=True)  # resolve on a common finer grid
480    _show("Case 2b — same, resolved with target='M' (window rebase across anchors)", out2, rep2)
481
482    # --- Case 3: daily + monthly.  Default grid is the finest present = D.
483    d = Series(
484        np.linspace(10, 12, 365),
485        index=pd.period_range("2023-01-01", periods=365, freq="D"),
486        name="rate",
487    )
488    mth = Series(
489        np.linspace(12, 13, 18),  # 2023-07..2024-12 — overlaps the daily over 2023-H2
490        index=pd.period_range("2023-07", periods=18, freq="M"),
491        name="rate",
492    )
493    out3, rep3 = splice([d, mth])  # daily priority -> finest grid = D, monthly placed sparsely
494    _show("Case 3 — D (priority) + M, default finest grid = D", out3, rep3)
495    out3b, rep3b = splice([mth, d], target="M", agg="mean")  # explicitly ask for a monthly result
496    _show("Case 3b — same data, target='M' so daily is aggregated down", out3b, rep3b)
497
498    # --- Case 4: CPI-style 3-way chain (new monthly + indicator + quarterly)
499    new_m = Series(np.arange(135.0, 135 + 12), index=pd.period_range("2024-01", periods=12, freq="M"), name="cpi")
500    indic = Series(np.arange(120.0, 120 + 30), index=pd.period_range("2022-07", periods=30, freq="M"), name="cpi")
501    old_q_index = pd.period_range("1995Q1", periods=120, freq="Q-DEC")
502    old_q = Series(np.arange(40.0, 40 + 120), index=old_q_index, name="cpi")
503    out4, rep4 = splice([new_m, indic, old_q], name="cpi_long", rebase=True)
504    _show("Case 4 — 3-way: new monthly + indicator + quarterly", out4, rep4)
505    print(
506        f"\nfull series spans {out4.index.min()} .. {out4.index.max()}, "
507        f"{out4.notna().sum()} observations present"
508    )
509
510    # --- Case 5: same, but ask for a clean quarterly output (downsample)
511    out5, rep5 = splice([new_m, indic, old_q], output="Q-DEC", name="cpi_long_q", rebase=True)
512    _show("Case 5 — same 3-way, resampled to a clean Q-DEC output", out5, rep5)
513
514    print("\nAll cases ran.")
def splice( segments: Iterable[pandas.Series], *, target: str | None = None, rebase: bool = False, agg: str = 'mean', output: str | None = None, fill: Literal['ffill', 'interpolate'] | None = None, name: str | None = None) -> tuple[pandas.Series, pandas.DataFrame]:
179def splice(
180    segments: Iterable[Series],
181    *,
182    target: str | None = None,
183    rebase: bool = False,
184    agg: str = "mean",
185    output: str | None = None,
186    fill: Literal["ffill", "interpolate"] | None = None,
187    name: str | None = None,
188) -> tuple[Series, DataFrame]:
189    """Splice mixed-frequency *segments* into one series, highest priority first.
190
191    Parameters
192    ----------
193    segments
194        Ordered list of pandas Series (PeriodIndex or DatetimeIndex).  The
195        first is highest priority: it wins where periods overlap and (when
196        ``rebase`` is on) sets the level everything else is rebased to.
197    target
198        Common-grid frequency (e.g. ``"M"``, ``"Q-DEC"``).  Defaults to the
199        finest frequency present (anchor clashes step one rank finer).
200    rebase
201        Off by default — segments are coalesced at their **raw** levels, with no
202        silent transformation of your data.  Set ``True`` to *multiplicatively*
203        rescale each lower-priority segment to the running result's level before
204        coalescing.  Rebasing assumes **ratio-scale** inputs (meaningful zero,
205        proportional discrepancy between segments) — splicing index series on
206        different base periods (CPI, price/volume indices) is the case that
207        needs it.  It is wrong for zero-crossing series (rates, balances) or
208        additive level breaks, and it *invents* a correction when same-unit
209        segments already agree — which is why it is opt-in.  A non-finite or
210        non-positive factor raises.  See the module docstring's *rebase* step.
211    agg
212        Aggregator used when a segment is finer than the grid (or when
213        downsampling to *output*).  ``"mean"`` for index levels; use ``"sum"``
214        for flows.
215    output
216        Optional final frequency to resample the spliced result to.
217    fill
218        Optional gap fill.  By default (``None``) the result contains only the
219        periods that actually have data — no NaN rows are inserted for the gaps
220        a coarse segment leaves on a finer grid, and nothing is interpolated.
221        ``"ffill"`` or ``"interpolate"`` densify the result onto the full grid
222        first and then fill.
223    name
224        Name for the result series (defaults to the first segment's name).
225
226    Returns
227    -------
228    tuple[Series, DataFrame]
229        The spliced series and a one-row-per-junction report.
230
231    """
232    segments = list(segments)
233    if not segments:
234        raise ValueError("splice() needs at least one segment.")
235
236    grid = target or _pick_target(segments)
237    on_grid = [_to_grid(s, grid, agg) for s in segments]
238
239    result = on_grid[0].copy()
240    rows: list[dict[str, object]] = []
241    for i, seg in enumerate(on_grid[1:], start=1):
242        if rebase:
243            factor, method, n, lo, hi = _rebase_factor(result, seg)
244            # Multiplicative rebasing assumes ratio-scale inputs.  A non-finite
245            # factor (near-zero denominator) or a non-positive one (the overlap
246            # means have opposite signs, which would flip the back-history) means
247            # the data is not ratio-scale — fail loud rather than ship it.  A
248            # large *magnitude* is fine: a legitimate base-period difference can
249            # need a 50x factor, so only sign and finiteness are guarded.
250            if not (math.isfinite(factor) and factor > 0):
251                raise ValueError(
252                    f"splice: rebase factor for segment {i} ('{seg.name}') is {factor} over "
253                    f"{lo}..{hi}. Multiplicative rebasing needs ratio-scale inputs (meaningful "
254                    f"zero, proportional discrepancy); a non-finite or non-positive factor means "
255                    f"the segments cross zero or differ additively. Pass rebase=False to coalesce "
256                    f"raw levels instead."
257                )
258        else:
259            factor, method, n, lo, hi = 1.0, "off", 0, None, None
260        seg_rebased = seg * factor
261        rows.append(
262            {
263                "segment": i,
264                "name": str(seg.name),
265                "freq_in": str(_pidx(segments[i]).freqstr),
266                "method": method,
267                "overlap_n": n,
268                "window_start": str(lo) if lo is not None else "",
269                "window_end": str(hi) if hi is not None else "",
270                "factor": round(factor, 6),
271                "fills_from": str(seg.dropna().index.min()),
272            }
273        )
274        result = result.combine_first(seg_rebased)
275
276    # By default keep only the periods that actually carry data: do NOT reindex
277    # onto a dense grid (which would manufacture NaN for the gaps a coarse
278    # back-history leaves on a finer grid) and do NOT interpolate.  A long-run
279    # series therefore stays sparse where it is old and coarse, and plots as one
280    # continuous line with no holes and no invented points.
281    result = result.dropna().sort_index()
282
283    if output and output != grid:
284        result = _to_grid(result, output, agg).dropna().sort_index()
285        grid = output
286
287    if fill in ("ffill", "interpolate") and len(result):
288        # Explicit opt-in: densify onto the full grid, then fill.
289        full = pd.period_range(result.index.min(), result.index.max(), freq=grid)
290        result = result.reindex(full)
291        result = result.ffill() if fill == "ffill" else result.interpolate()
292
293    result.name = name or str(segments[0].name)
294    report = DataFrame(rows)
295    return result, report

Splice mixed-frequency segments into one series, highest priority first.

Parameters

segments Ordered list of pandas Series (PeriodIndex or DatetimeIndex). The first is highest priority: it wins where periods overlap and (when rebase is on) sets the level everything else is rebased to. target Common-grid frequency (e.g. "M", "Q-DEC"). Defaults to the finest frequency present (anchor clashes step one rank finer). rebase Off by default — segments are coalesced at their raw levels, with no silent transformation of your data. Set True to multiplicatively rescale each lower-priority segment to the running result's level before coalescing. Rebasing assumes ratio-scale inputs (meaningful zero, proportional discrepancy between segments) — splicing index series on different base periods (CPI, price/volume indices) is the case that needs it. It is wrong for zero-crossing series (rates, balances) or additive level breaks, and it invents a correction when same-unit segments already agree — which is why it is opt-in. A non-finite or non-positive factor raises. See the module docstring's rebase step. agg Aggregator used when a segment is finer than the grid (or when downsampling to output). "mean" for index levels; use "sum" for flows. output Optional final frequency to resample the spliced result to. fill Optional gap fill. By default (None) the result contains only the periods that actually have data — no NaN rows are inserted for the gaps a coarse segment leaves on a finer grid, and nothing is interpolated. "ffill" or "interpolate" densify the result onto the full grid first and then fill. name Name for the result series (defaults to the first segment's name).

Returns

tuple[Series, DataFrame] The spliced series and a one-row-per-junction report.

Source = tuple[dict[str, pandas.DataFrame], pandas.DataFrame, dict[str, str]]
def select_one( data: dict[str, pandas.DataFrame], meta: pandas.DataFrame, selector: dict[str, str]) -> pandas.Series:
303def select_one(data: dict[str, DataFrame], meta: DataFrame, selector: dict[str, str]) -> Series:
304    """Select the single Series for one ``(data, meta, selector)`` — the single-source wrapper.
305
306    Convenience for the common one-selector case; equivalent to
307    ``select([(data, meta, selector)])[0]``.  Returns the Series named by its
308    Series ID, with its ABS unit on ``.attrs["unit"]``.
309    """
310    table, series_id, unit = find_abs_id(meta, selector, validate_unique=True)
311    s = data[table][series_id].copy()
312    s.name = series_id
313    s.attrs["unit"] = str(unit)
314    return s

Select the single Series for one (data, meta, selector) — the single-source wrapper.

Convenience for the common one-selector case; equivalent to select([(data, meta, selector)])[0]. Returns the Series named by its Series ID, with its ABS unit on .attrs["unit"].

def select( sources: Iterable[tuple[dict[str, pandas.DataFrame], pandas.DataFrame, dict[str, str]]], *, require_same_units: bool = True) -> list[pandas.Series]:
317def select(sources: Iterable[Source], *, require_same_units: bool = True) -> list[Series]:
318    """Select a series for each ``(data, meta, selector)`` — the iterable in, iterable out.
319
320    The composable selection primitive: takes the iterable of ``(data, meta,
321    selector)`` sources and returns the matching list of Series, ready to hand to
322    :func:`splice` (directly, or after a per-series transform).  Each selection
323    goes through ``readabs.find_abs_id`` with ``validate_unique=True``, which
324    de-duplicates on Series ID first — so a selector matching the same series in
325    several tables resolves cleanly, while one matching two genuinely different
326    series raises rather than guessing.
327
328    Parameters
329    ----------
330    sources
331        Iterable of ``(data, meta, selector)``:
332
333        - ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
334        - ``meta``   — the matching metadata DataFrame.
335        - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g.
336          ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
337          "Index Numbers": mc.unit, "Quarter": mc.freq}``.
338    require_same_units
339        If ``True`` (default) **raise** when the selected series do not all share
340        the same ABS unit — units must cohere to be spliced.  Set ``False`` when
341        you deliberately select different-unit series together (e.g. two counts
342        and a rate that you will combine yourself).
343
344    Returns
345    -------
346    list[Series]
347        One Series per source, each named by its Series ID with its ABS unit in
348        ``series.attrs["unit"]``.  Unpack it (``a, b = select([...])``), map a
349        transform over it, or pass it straight to :func:`splice`.  A later
350        transform drops the unit attr — correctly, since the unit is then no
351        longer the ABS one.
352
353    Raises
354    ------
355    ValueError
356        If ``require_same_units`` and the selected series carry mixed units.
357
358    """
359    segments = [select_one(data, meta, selector) for data, meta, selector in sources]
360    if require_same_units:
361        units = [str(s.attrs.get("unit", "")) for s in segments]
362        if len(set(units)) > 1:
363            detail = ", ".join(f"{s.name}={u!r}" for s, u in zip(segments, units, strict=True))
364            raise ValueError(
365                f"select: selected series have mismatched units ({detail}). Pass "
366                f"require_same_units=False to select different-unit series together."
367            )
368    return segments

Select a series for each (data, meta, selector) — the iterable in, iterable out.

The composable selection primitive: takes the iterable of (data, meta, selector) sources and returns the matching list of Series, ready to hand to splice() (directly, or after a per-series transform). Each selection goes through readabs.find_abs_id with validate_unique=True, which de-duplicates on Series ID first — so a selector matching the same series in several tables resolves cleanly, while one matching two genuinely different series raises rather than guessing.

Parameters

sources Iterable of (data, meta, selector):

- ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
- ``meta``   — the matching metadata DataFrame.
- ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g.
  ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
  "Index Numbers": mc.unit, "Quarter": mc.freq}``.

require_same_units If True (default) raise when the selected series do not all share the same ABS unit — units must cohere to be spliced. Set False when you deliberately select different-unit series together (e.g. two counts and a rate that you will combine yourself).

Returns

list[Series] One Series per source, each named by its Series ID with its ABS unit in series.attrs["unit"]. Unpack it (a, b = select([...])), map a transform over it, or pass it straight to splice(). A later transform drops the unit attr — correctly, since the unit is then no longer the ABS one.

Raises

ValueError If require_same_units and the selected series carry mixed units.

def select_and_splice( sources: Iterable[tuple[dict[str, pandas.DataFrame], pandas.DataFrame, dict[str, str]]], *, target: str | None = None, rebase: bool = False, agg: str = 'mean', output: str | None = None, fill: Literal['ffill', 'interpolate'] | None = None, name: str | None = None, require_same_units: bool = True) -> tuple[pandas.Series, str, pandas.DataFrame]:
371def select_and_splice(
372    sources: Iterable[Source],
373    *,
374    target: str | None = None,
375    rebase: bool = False,
376    agg: str = "mean",
377    output: str | None = None,
378    fill: Literal["ffill", "interpolate"] | None = None,
379    name: str | None = None,
380    require_same_units: bool = True,
381) -> tuple[Series, str, DataFrame]:
382    """Select one series per source and :func:`splice` them — the no-transform case.
383
384    Sugar for ``splice(select(*src) for src in sources)`` with a unit guard.  When
385    you need a transform *between* selecting and splicing (e.g. a growth rate),
386    compose :func:`select` and :func:`splice` directly instead — that is the whole
387    reason :func:`select` is exposed separately.
388
389    Parameters
390    ----------
391    sources
392        Ordered iterable of ``(data, meta, selector)``, **highest priority
393        first** (same priority rule as :func:`splice`):
394
395        - ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
396        - ``meta``   — the matching metadata DataFrame.
397        - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``,
398          e.g. ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
399          "Index Numbers": mc.unit, "Quarter": mc.freq}``.  In the common case
400          the only thing differing between two sources is the frequency, so a
401          shared *base* selector composes with ``base | {"Quarter": mc.freq}``.
402    target, rebase, agg, output, fill, name
403        Passed straight through to :func:`splice`.
404    require_same_units
405        Forwarded to :func:`select`: if ``True`` (default) raise when the
406        selected segments carry mixed units; ``False`` overrides (the result is
407        then labelled with the highest-priority segment's unit).
408
409    Returns
410    -------
411    tuple[Series, str, DataFrame]
412        The spliced series, its unit (the highest-priority segment's unit), and
413        the :func:`splice` join report, augmented with ``series_id`` and
414        ``unit`` columns recording what each segment resolved to.
415
416    """
417    segments = select(sources, require_same_units=require_same_units)
418    units = [str(s.attrs.get("unit", "")) for s in segments]
419
420    result, report = splice(
421        segments, target=target, rebase=rebase, agg=agg, output=output, fill=fill, name=name
422    )
423    # Audit trail: which Series ID / unit did each reported (lower-priority) segment use?
424    if len(report):
425        seg = [int(i) for i in report["segment"]]
426        report.insert(1, "series_id", [str(segments[i].name) for i in seg])
427        report.insert(2, "unit", [units[i] for i in seg])
428    return result, units[0], report

Select one series per source and splice() them — the no-transform case.

Sugar for splice(select(*src) for src in sources) with a unit guard. When you need a transform between selecting and splicing (e.g. a growth rate), compose select() and splice() directly instead — that is the whole reason select() is exposed separately.

Parameters

sources Ordered iterable of (data, meta, selector), highest priority first (same priority rule as splice()):

- ``data``   — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
- ``meta``   — the matching metadata DataFrame.
- ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``,
  e.g. ``{"Index Numbers ;  All groups CPI ;  Australia ;": mc.did,
  "Index Numbers": mc.unit, "Quarter": mc.freq}``.  In the common case
  the only thing differing between two sources is the frequency, so a
  shared *base* selector composes with ``base | {"Quarter": mc.freq}``.

target, rebase, agg, output, fill, name Passed straight through to splice(). require_same_units Forwarded to select(): if True (default) raise when the selected segments carry mixed units; False overrides (the result is then labelled with the highest-priority segment's unit).

Returns

tuple[Series, str, DataFrame] The spliced series, its unit (the highest-priority segment's unit), and the splice() join report, augmented with series_id and unit columns recording what each segment resolved to.