readabs.splice
Priority splicing of mixed-frequency time series.
This module has two layers:
splice
The core primitive. Deliberately source-agnostic: it takes pandas Series
you have already fetched (by description, by ID, however you like) and
splices them into one series. It knows nothing about the ABS, ships no
static lookup table, and makes no guesses about which series belong together
— that judgement stays with the caller.
select / select_one / select_and_splice
A thin ABS-aware convenience layer over splice. Each resolves
(data, meta, selector) sources to Series via readabs.find_abs_id
(carrying each series' ABS unit on .attrs["unit"]), so the common case —
splice a few ABS series selected by description/frequency — is one call,
while select stays exposed for when you need a transform between
selecting and splicing.
Splice design
Given an ordered list of segments (highest priority / most authoritative
first), splice():
- align — put every segment on one common
PeriodIndex. By default the grid is the finest frequency present, which dissolves anchor clashes (Q-NOV vs Q-DEC, A-JUN vs A-DEC) because every coarse period maps cleanly onto a finer one. Coarser segments are placed at their period-end; finer segments are aggregated down withagg. rebase — (opt-in; off by default) for each junction, multiplicatively scale the lower-priority segment so its level matches the running result over the overlapping date window (phase-agnostic; works even when two series never share an exact period). Falls back to a single junction point if there is no overlap, and flags it. Off by default because it transforms your data — nothing is silently rescaled unless you ask.
Rebasing assumes **ratio-scale** inputs — series whose zero is meaningful and whose discrepancy between segments is *proportional*. Indexes (CPI, price/volume indices on different base periods) are the canonical case; a proportional benchmark revision of a count works too. It is **wrong** for series that cross zero (rates of change, balances, net flows) or whose segments differ by an *additive* offset rather than a scale factor — a negative or non-finite factor is caught and raises. With ``rebase=False`` (the default) the raw levels are coalesced as-is: if two same-unit segments already agree, rebasing only invents a discrepancy to "correct".- coalesce—
combine_firstdown the priority chain: take segment 1, fill gaps from segment 2, then 3, ... The result keeps only the periods that actually carry data — a coarse back-history stays sparse on a finer grid rather than being NaN-filled, and nothing is interpolated (passfill=to densify). - resample— (optional) resample the spliced result to a chosen output frequency/anchor.
The returned join report makes every rebase factor and overlap visible, so a splice can be audited rather than trusted blindly.
1"""Priority splicing of mixed-frequency time series. 2 3This module has two layers: 4 5``splice`` 6 The core primitive. Deliberately *source-agnostic*: it takes pandas Series 7 you have already fetched (by description, by ID, however you like) and 8 splices them into one series. It knows nothing about the ABS, ships no 9 static lookup table, and makes no guesses about which series belong together 10 — that judgement stays with the caller. 11 12``select`` / ``select_one`` / ``select_and_splice`` 13 A thin ABS-aware convenience layer over ``splice``. Each resolves 14 ``(data, meta, selector)`` sources to Series via ``readabs.find_abs_id`` 15 (carrying each series' ABS unit on ``.attrs["unit"]``), so the common case — 16 splice a few ABS series selected by description/frequency — is one call, 17 while ``select`` stays exposed for when you need a transform between 18 selecting and splicing. 19 20Splice design 21------------- 22Given an ordered list of segments (highest priority / most authoritative 23first), :func:`splice`: 24 251. **align** — put every segment on one common ``PeriodIndex``. By default 26 the grid is the *finest* frequency present, which dissolves 27 anchor clashes (Q-NOV vs Q-DEC, A-JUN vs A-DEC) because every 28 coarse period maps cleanly onto a finer one. Coarser segments 29 are placed at their period-*end*; finer segments are 30 aggregated down with ``agg``. 312. **rebase** — *(opt-in; off by default)* for each junction, 32 *multiplicatively* scale the lower-priority segment so its level 33 matches the running result over the *overlapping date window* 34 (phase-agnostic; works even when two series never share an exact 35 period). Falls back to a single junction point if there is no 36 overlap, and flags it. Off by default because it transforms 37 your data — nothing is silently rescaled unless you ask. 38 39 Rebasing assumes **ratio-scale** inputs — series whose zero is 40 meaningful and whose discrepancy between segments is 41 *proportional*. Indexes (CPI, price/volume indices on different 42 base periods) are the canonical case; a proportional benchmark 43 revision of a count works too. It is **wrong** for series that 44 cross zero (rates of change, balances, net flows) or whose 45 segments differ by an *additive* offset rather than a scale 46 factor — a negative or non-finite factor is caught and raises. 47 With ``rebase=False`` (the default) the raw levels are coalesced 48 as-is: if two same-unit segments already agree, rebasing only 49 invents a discrepancy to "correct". 503. **coalesce**— ``combine_first`` down the priority chain: take segment 1, 51 fill gaps from segment 2, then 3, ... The result keeps only 52 the periods that actually carry data — a coarse back-history 53 stays sparse on a finer grid rather than being NaN-filled, and 54 nothing is interpolated (pass ``fill=`` to densify). 554. **resample**— (optional) resample the spliced result to a chosen output 56 frequency/anchor. 57 58The returned join report makes every rebase factor and overlap visible, so a 59splice can be audited rather than trusted blindly. 60""" 61 62from __future__ import annotations 63 64import math 65from collections.abc import Iterable, Sequence 66from typing import Literal, cast 67 68import pandas as pd 69from pandas import DataFrame, PeriodIndex, Series 70 71from readabs.search_abs_meta import find_abs_id # used by the select() layer 72 73# Frequency rank — higher number = finer frequency. 74_FREQ_RANK: dict[str, int] = {"Y": 0, "A": 0, "Q": 1, "M": 2, "W": 3, "D": 4} 75 76 77def _base(freqstr: str) -> str: 78 """Return the base frequency character (``"Q-NOV"`` -> ``"Q"``, ``"A-JUN"`` -> ``"Y"``).""" 79 char = freqstr.split("-", maxsplit=1)[0][0].upper() 80 return "Y" if char == "A" else char 81 82 83def _rank(freqstr: str) -> int: 84 """Return the frequency rank for a PeriodIndex freq string.""" 85 return _FREQ_RANK[_base(freqstr)] 86 87 88def _as_period_index(s: Series) -> Series: 89 """Ensure *s* has a PeriodIndex; convert from DatetimeIndex if needed.""" 90 if isinstance(s.index, PeriodIndex): 91 return s 92 if isinstance(s.index, pd.DatetimeIndex): 93 return s.set_axis(s.index.to_period()) 94 raise TypeError(f"Series '{s.name}' must have a PeriodIndex or DatetimeIndex, got {type(s.index).__name__}.") 95 96 97def _pidx(s: Series) -> PeriodIndex: 98 """Return *s*'s index as a (typed) PeriodIndex, converting if necessary.""" 99 return cast("PeriodIndex", _as_period_index(s).index) 100 101 102def _pick_target(segments: Sequence[Series]) -> str: 103 """Choose the default common-grid freq: the finest present. 104 105 If two or more segments share the *finest* rank but with different anchors 106 (e.g. ``Q-NOV`` and ``Q-DEC``) and there is nothing finer to splice them 107 onto, raise — picking one anchor would silently reanchor the other and 108 could assume wrong. Resolve it by passing a finer ``target`` (e.g. 109 ``"M"``), or by including a finer-frequency segment. 110 """ 111 freqs = [str(_pidx(s).freqstr) for s in segments] 112 ranks = [_rank(f) for f in freqs] 113 top = max(ranks) 114 top_freqs = {f for f, r in zip(freqs, ranks, strict=True) if r == top} 115 if len(top_freqs) > 1: 116 raise ValueError( 117 f"Clashing anchors at the finest frequency: {sorted(top_freqs)}. " 118 f"Pass a finer target (e.g. target='M') to splice them on a common grid." 119 ) 120 return next(iter(top_freqs)) 121 122 123def _to_grid(s: Series, target: str, agg: str) -> Series: 124 """Map *s* onto the *target* PeriodIndex frequency. 125 126 Finer-than-target segments are aggregated down with *agg*; equal-or-coarser 127 segments are placed at their period-end on the target grid. 128 """ 129 s = _as_period_index(s).dropna() 130 idx = cast("PeriodIndex", s.index) 131 src = str(idx.freqstr) 132 if _rank(src) > _rank(target): 133 # finer -> coarser: aggregate the sub-periods that fall in each target period 134 out = s.groupby(idx.asfreq(target)).agg(agg) 135 elif _rank(src) == _rank(target) and _base(src) == _base(target) and src != target: 136 # same frequency, different anchor (e.g. Q-NOV vs Q-DEC) — reanchoring 137 # would silently shift every period, so refuse rather than assume. 138 raise ValueError( 139 f"Cannot place '{s.name}' ({src}) onto a {target} grid without reanchoring. " 140 f"Use a finer target (e.g. target='M')." 141 ) 142 else: 143 # coarser (or identical) -> place each value at its period-end on the grid 144 out = Series(s.to_numpy(), index=idx.asfreq(target, how="E"), name=s.name) 145 out = out[~out.index.duplicated(keep="last")] 146 return out.sort_index() 147 148 149def _rebase_factor( 150 result: Series, seg: Series 151) -> tuple[float, str, int, pd.Period | None, pd.Period | None]: 152 """Compute the factor to bring *seg* onto *result*'s level. 153 154 Measured as the ratio of mean levels over the overlapping *date span*, so 155 it is phase-agnostic — it works even when the two series share no exact 156 period (e.g. Q-NOV vs Q-DEC mapped onto a monthly grid). Falls back to a 157 single junction point when the spans do not overlap at all. 158 159 Returns ``(factor, method, overlap_n, window_start, window_end)``. 160 """ 161 r, s = result.dropna(), seg.dropna() 162 if len(r) and len(s): 163 lo = max(r.index.min(), s.index.min()) 164 hi = min(r.index.max(), s.index.max()) 165 if lo <= hi: 166 r_win, s_win = r.loc[lo:hi], s.loc[lo:hi] 167 if len(r_win) and len(s_win) and s_win.mean(): 168 return float(r_win.mean() / s_win.mean()), "window", min(len(r_win), len(s_win)), lo, hi 169 # No overlapping span — fall back to the nearest junction point. 170 r0 = result.first_valid_index() 171 if r0 is not None: 172 before = s.loc[:r0] 173 if len(before) and before.iloc[-1]: 174 return float(result.loc[r0] / before.iloc[-1]), "junction", 0, None, None 175 return 1.0, "none", 0, None, None 176 177 178def splice( 179 segments: Iterable[Series], 180 *, 181 target: str | None = None, 182 rebase: bool = False, 183 agg: str = "mean", 184 output: str | None = None, 185 fill: Literal["ffill", "interpolate"] | None = None, 186 name: str | None = None, 187) -> tuple[Series, DataFrame]: 188 """Splice mixed-frequency *segments* into one series, highest priority first. 189 190 Parameters 191 ---------- 192 segments 193 Ordered list of pandas Series (PeriodIndex or DatetimeIndex). The 194 first is highest priority: it wins where periods overlap and (when 195 ``rebase`` is on) sets the level everything else is rebased to. 196 target 197 Common-grid frequency (e.g. ``"M"``, ``"Q-DEC"``). Defaults to the 198 finest frequency present (anchor clashes step one rank finer). 199 rebase 200 Off by default — segments are coalesced at their **raw** levels, with no 201 silent transformation of your data. Set ``True`` to *multiplicatively* 202 rescale each lower-priority segment to the running result's level before 203 coalescing. Rebasing assumes **ratio-scale** inputs (meaningful zero, 204 proportional discrepancy between segments) — splicing index series on 205 different base periods (CPI, price/volume indices) is the case that 206 needs it. It is wrong for zero-crossing series (rates, balances) or 207 additive level breaks, and it *invents* a correction when same-unit 208 segments already agree — which is why it is opt-in. A non-finite or 209 non-positive factor raises. See the module docstring's *rebase* step. 210 agg 211 Aggregator used when a segment is finer than the grid (or when 212 downsampling to *output*). ``"mean"`` for index levels; use ``"sum"`` 213 for flows. 214 output 215 Optional final frequency to resample the spliced result to. 216 fill 217 Optional gap fill. By default (``None``) the result contains only the 218 periods that actually have data — no NaN rows are inserted for the gaps 219 a coarse segment leaves on a finer grid, and nothing is interpolated. 220 ``"ffill"`` or ``"interpolate"`` densify the result onto the full grid 221 first and then fill. 222 name 223 Name for the result series (defaults to the first segment's name). 224 225 Returns 226 ------- 227 tuple[Series, DataFrame] 228 The spliced series and a one-row-per-junction report. 229 230 """ 231 segments = list(segments) 232 if not segments: 233 raise ValueError("splice() needs at least one segment.") 234 235 grid = target or _pick_target(segments) 236 on_grid = [_to_grid(s, grid, agg) for s in segments] 237 238 result = on_grid[0].copy() 239 rows: list[dict[str, object]] = [] 240 for i, seg in enumerate(on_grid[1:], start=1): 241 if rebase: 242 factor, method, n, lo, hi = _rebase_factor(result, seg) 243 # Multiplicative rebasing assumes ratio-scale inputs. A non-finite 244 # factor (near-zero denominator) or a non-positive one (the overlap 245 # means have opposite signs, which would flip the back-history) means 246 # the data is not ratio-scale — fail loud rather than ship it. A 247 # large *magnitude* is fine: a legitimate base-period difference can 248 # need a 50x factor, so only sign and finiteness are guarded. 249 if not (math.isfinite(factor) and factor > 0): 250 raise ValueError( 251 f"splice: rebase factor for segment {i} ('{seg.name}') is {factor} over " 252 f"{lo}..{hi}. Multiplicative rebasing needs ratio-scale inputs (meaningful " 253 f"zero, proportional discrepancy); a non-finite or non-positive factor means " 254 f"the segments cross zero or differ additively. Pass rebase=False to coalesce " 255 f"raw levels instead." 256 ) 257 else: 258 factor, method, n, lo, hi = 1.0, "off", 0, None, None 259 seg_rebased = seg * factor 260 rows.append( 261 { 262 "segment": i, 263 "name": str(seg.name), 264 "freq_in": str(_pidx(segments[i]).freqstr), 265 "method": method, 266 "overlap_n": n, 267 "window_start": str(lo) if lo is not None else "", 268 "window_end": str(hi) if hi is not None else "", 269 "factor": round(factor, 6), 270 "fills_from": str(seg.dropna().index.min()), 271 } 272 ) 273 result = result.combine_first(seg_rebased) 274 275 # By default keep only the periods that actually carry data: do NOT reindex 276 # onto a dense grid (which would manufacture NaN for the gaps a coarse 277 # back-history leaves on a finer grid) and do NOT interpolate. A long-run 278 # series therefore stays sparse where it is old and coarse, and plots as one 279 # continuous line with no holes and no invented points. 280 result = result.dropna().sort_index() 281 282 if output and output != grid: 283 result = _to_grid(result, output, agg).dropna().sort_index() 284 grid = output 285 286 if fill in ("ffill", "interpolate") and len(result): 287 # Explicit opt-in: densify onto the full grid, then fill. 288 full = pd.period_range(result.index.min(), result.index.max(), freq=grid) 289 result = result.reindex(full) 290 result = result.ffill() if fill == "ffill" else result.interpolate() 291 292 result.name = name or str(segments[0].name) 293 report = DataFrame(rows) 294 return result, report 295 296 297# A select_and_splice() source: the fetched data dict, its meta, and a 298# {search_value: meta_column} selector (readabs' find_abs_id convention). 299Source = tuple[dict[str, DataFrame], DataFrame, dict[str, str]] 300 301 302def select_one(data: dict[str, DataFrame], meta: DataFrame, selector: dict[str, str]) -> Series: 303 """Select the single Series for one ``(data, meta, selector)`` — the single-source wrapper. 304 305 Convenience for the common one-selector case; equivalent to 306 ``select([(data, meta, selector)])[0]``. Returns the Series named by its 307 Series ID, with its ABS unit on ``.attrs["unit"]``. 308 """ 309 table, series_id, unit = find_abs_id(meta, selector, validate_unique=True) 310 s = data[table][series_id].copy() 311 s.name = series_id 312 s.attrs["unit"] = str(unit) 313 return s 314 315 316def select(sources: Iterable[Source], *, require_same_units: bool = True) -> list[Series]: 317 """Select a series for each ``(data, meta, selector)`` — the iterable in, iterable out. 318 319 The composable selection primitive: takes the iterable of ``(data, meta, 320 selector)`` sources and returns the matching list of Series, ready to hand to 321 :func:`splice` (directly, or after a per-series transform). Each selection 322 goes through ``readabs.find_abs_id`` with ``validate_unique=True``, which 323 de-duplicates on Series ID first — so a selector matching the same series in 324 several tables resolves cleanly, while one matching two genuinely different 325 series raises rather than guessing. 326 327 Parameters 328 ---------- 329 sources 330 Iterable of ``(data, meta, selector)``: 331 332 - ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``. 333 - ``meta`` — the matching metadata DataFrame. 334 - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g. 335 ``{"Index Numbers ; All groups CPI ; Australia ;": mc.did, 336 "Index Numbers": mc.unit, "Quarter": mc.freq}``. 337 require_same_units 338 If ``True`` (default) **raise** when the selected series do not all share 339 the same ABS unit — units must cohere to be spliced. Set ``False`` when 340 you deliberately select different-unit series together (e.g. two counts 341 and a rate that you will combine yourself). 342 343 Returns 344 ------- 345 list[Series] 346 One Series per source, each named by its Series ID with its ABS unit in 347 ``series.attrs["unit"]``. Unpack it (``a, b = select([...])``), map a 348 transform over it, or pass it straight to :func:`splice`. A later 349 transform drops the unit attr — correctly, since the unit is then no 350 longer the ABS one. 351 352 Raises 353 ------ 354 ValueError 355 If ``require_same_units`` and the selected series carry mixed units. 356 357 """ 358 segments = [select_one(data, meta, selector) for data, meta, selector in sources] 359 if require_same_units: 360 units = [str(s.attrs.get("unit", "")) for s in segments] 361 if len(set(units)) > 1: 362 detail = ", ".join(f"{s.name}={u!r}" for s, u in zip(segments, units, strict=True)) 363 raise ValueError( 364 f"select: selected series have mismatched units ({detail}). Pass " 365 f"require_same_units=False to select different-unit series together." 366 ) 367 return segments 368 369 370def select_and_splice( 371 sources: Iterable[Source], 372 *, 373 target: str | None = None, 374 rebase: bool = False, 375 agg: str = "mean", 376 output: str | None = None, 377 fill: Literal["ffill", "interpolate"] | None = None, 378 name: str | None = None, 379 require_same_units: bool = True, 380) -> tuple[Series, str, DataFrame]: 381 """Select one series per source and :func:`splice` them — the no-transform case. 382 383 Sugar for ``splice(select(*src) for src in sources)`` with a unit guard. When 384 you need a transform *between* selecting and splicing (e.g. a growth rate), 385 compose :func:`select` and :func:`splice` directly instead — that is the whole 386 reason :func:`select` is exposed separately. 387 388 Parameters 389 ---------- 390 sources 391 Ordered iterable of ``(data, meta, selector)``, **highest priority 392 first** (same priority rule as :func:`splice`): 393 394 - ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``. 395 - ``meta`` — the matching metadata DataFrame. 396 - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, 397 e.g. ``{"Index Numbers ; All groups CPI ; Australia ;": mc.did, 398 "Index Numbers": mc.unit, "Quarter": mc.freq}``. In the common case 399 the only thing differing between two sources is the frequency, so a 400 shared *base* selector composes with ``base | {"Quarter": mc.freq}``. 401 target, rebase, agg, output, fill, name 402 Passed straight through to :func:`splice`. 403 require_same_units 404 Forwarded to :func:`select`: if ``True`` (default) raise when the 405 selected segments carry mixed units; ``False`` overrides (the result is 406 then labelled with the highest-priority segment's unit). 407 408 Returns 409 ------- 410 tuple[Series, str, DataFrame] 411 The spliced series, its unit (the highest-priority segment's unit), and 412 the :func:`splice` join report, augmented with ``series_id`` and 413 ``unit`` columns recording what each segment resolved to. 414 415 """ 416 segments = select(sources, require_same_units=require_same_units) 417 units = [str(s.attrs.get("unit", "")) for s in segments] 418 419 result, report = splice( 420 segments, target=target, rebase=rebase, agg=agg, output=output, fill=fill, name=name 421 ) 422 # Audit trail: which Series ID / unit did each reported (lower-priority) segment use? 423 if len(report): 424 seg = [int(i) for i in report["segment"]] 425 report.insert(1, "series_id", [str(segments[i].name) for i in seg]) 426 report.insert(2, "unit", [units[i] for i in seg]) 427 return result, units[0], report 428 429 430# --------------------------------------------------------------------------- 431# Self-tests — `python splice.py` 432# --------------------------------------------------------------------------- 433if __name__ == "__main__": 434 import numpy as np 435 436 def _show(title: str, s: Series, rep: DataFrame) -> None: 437 print(f"\n{'=' * 70}\n{title}\n{'=' * 70}") 438 print( 439 f"freq={cast('PeriodIndex', s.index).freqstr} n={len(s)} non-null={s.notna().sum()} " 440 f"range={s.index.min()}..{s.index.max()}" 441 ) 442 if len(rep): 443 print(rep.to_string(index=False)) 444 445 # --- Case 1: monthly (new) + quarterly (old), level shift via index rebase 446 q = Series( 447 np.arange(100, 100 + 4 * 20, dtype=float), # 20 years quarterly, base ~100 448 index=pd.period_range("2000Q1", periods=80, freq="Q-DEC"), 449 name="cpi", 450 ) 451 m = Series( 452 np.arange(50.0, 50.0 + 60) * 0.5 + 130, # monthly on a *different* base 453 index=pd.period_range("2018-01", periods=60, freq="M"), 454 name="cpi", 455 ) 456 out, rep = splice([m, q], rebase=True) # monthly priority, quarterly fills the back-history 457 _show("Case 1 — M (priority) spliced with Q-DEC, auto-grid", out, rep) 458 print( 459 f"check: rebased Q value at 2018-03 = {out.loc['2018-03']:.3f} " 460 f"(monthly 2018-01 = {m.iloc[0]:.3f})" 461 ) 462 463 # --- Case 2: the anchor clash — Q-NOV vs Q-DEC, overlapping in time 464 q_dec = Series( 465 np.arange(200.0, 200 + 40), 466 index=pd.period_range("2010Q1", periods=40, freq="Q-DEC"), 467 name="x", 468 ) 469 q_nov = Series( 470 np.arange(80.0, 80 + 60), # 2000Q1..2014Q4 — overlaps q_dec over 2010-2014 471 index=pd.period_range("2000Q1", periods=60, freq="Q-NOV"), 472 name="x", 473 ) 474 print(f"\n{'=' * 70}\nCase 2 — Q-DEC + Q-NOV anchor clash\n{'=' * 70}") 475 try: 476 splice([q_dec, q_nov]) # no target -> must refuse rather than reanchor 477 except ValueError as exc: 478 print(f"default (no target) correctly raised:\n {exc}") 479 out2, rep2 = splice([q_dec, q_nov], target="M", rebase=True) # resolve on a common finer grid 480 _show("Case 2b — same, resolved with target='M' (window rebase across anchors)", out2, rep2) 481 482 # --- Case 3: daily + monthly. Default grid is the finest present = D. 483 d = Series( 484 np.linspace(10, 12, 365), 485 index=pd.period_range("2023-01-01", periods=365, freq="D"), 486 name="rate", 487 ) 488 mth = Series( 489 np.linspace(12, 13, 18), # 2023-07..2024-12 — overlaps the daily over 2023-H2 490 index=pd.period_range("2023-07", periods=18, freq="M"), 491 name="rate", 492 ) 493 out3, rep3 = splice([d, mth]) # daily priority -> finest grid = D, monthly placed sparsely 494 _show("Case 3 — D (priority) + M, default finest grid = D", out3, rep3) 495 out3b, rep3b = splice([mth, d], target="M", agg="mean") # explicitly ask for a monthly result 496 _show("Case 3b — same data, target='M' so daily is aggregated down", out3b, rep3b) 497 498 # --- Case 4: CPI-style 3-way chain (new monthly + indicator + quarterly) 499 new_m = Series(np.arange(135.0, 135 + 12), index=pd.period_range("2024-01", periods=12, freq="M"), name="cpi") 500 indic = Series(np.arange(120.0, 120 + 30), index=pd.period_range("2022-07", periods=30, freq="M"), name="cpi") 501 old_q_index = pd.period_range("1995Q1", periods=120, freq="Q-DEC") 502 old_q = Series(np.arange(40.0, 40 + 120), index=old_q_index, name="cpi") 503 out4, rep4 = splice([new_m, indic, old_q], name="cpi_long", rebase=True) 504 _show("Case 4 — 3-way: new monthly + indicator + quarterly", out4, rep4) 505 print( 506 f"\nfull series spans {out4.index.min()} .. {out4.index.max()}, " 507 f"{out4.notna().sum()} observations present" 508 ) 509 510 # --- Case 5: same, but ask for a clean quarterly output (downsample) 511 out5, rep5 = splice([new_m, indic, old_q], output="Q-DEC", name="cpi_long_q", rebase=True) 512 _show("Case 5 — same 3-way, resampled to a clean Q-DEC output", out5, rep5) 513 514 print("\nAll cases ran.")
179def splice( 180 segments: Iterable[Series], 181 *, 182 target: str | None = None, 183 rebase: bool = False, 184 agg: str = "mean", 185 output: str | None = None, 186 fill: Literal["ffill", "interpolate"] | None = None, 187 name: str | None = None, 188) -> tuple[Series, DataFrame]: 189 """Splice mixed-frequency *segments* into one series, highest priority first. 190 191 Parameters 192 ---------- 193 segments 194 Ordered list of pandas Series (PeriodIndex or DatetimeIndex). The 195 first is highest priority: it wins where periods overlap and (when 196 ``rebase`` is on) sets the level everything else is rebased to. 197 target 198 Common-grid frequency (e.g. ``"M"``, ``"Q-DEC"``). Defaults to the 199 finest frequency present (anchor clashes step one rank finer). 200 rebase 201 Off by default — segments are coalesced at their **raw** levels, with no 202 silent transformation of your data. Set ``True`` to *multiplicatively* 203 rescale each lower-priority segment to the running result's level before 204 coalescing. Rebasing assumes **ratio-scale** inputs (meaningful zero, 205 proportional discrepancy between segments) — splicing index series on 206 different base periods (CPI, price/volume indices) is the case that 207 needs it. It is wrong for zero-crossing series (rates, balances) or 208 additive level breaks, and it *invents* a correction when same-unit 209 segments already agree — which is why it is opt-in. A non-finite or 210 non-positive factor raises. See the module docstring's *rebase* step. 211 agg 212 Aggregator used when a segment is finer than the grid (or when 213 downsampling to *output*). ``"mean"`` for index levels; use ``"sum"`` 214 for flows. 215 output 216 Optional final frequency to resample the spliced result to. 217 fill 218 Optional gap fill. By default (``None``) the result contains only the 219 periods that actually have data — no NaN rows are inserted for the gaps 220 a coarse segment leaves on a finer grid, and nothing is interpolated. 221 ``"ffill"`` or ``"interpolate"`` densify the result onto the full grid 222 first and then fill. 223 name 224 Name for the result series (defaults to the first segment's name). 225 226 Returns 227 ------- 228 tuple[Series, DataFrame] 229 The spliced series and a one-row-per-junction report. 230 231 """ 232 segments = list(segments) 233 if not segments: 234 raise ValueError("splice() needs at least one segment.") 235 236 grid = target or _pick_target(segments) 237 on_grid = [_to_grid(s, grid, agg) for s in segments] 238 239 result = on_grid[0].copy() 240 rows: list[dict[str, object]] = [] 241 for i, seg in enumerate(on_grid[1:], start=1): 242 if rebase: 243 factor, method, n, lo, hi = _rebase_factor(result, seg) 244 # Multiplicative rebasing assumes ratio-scale inputs. A non-finite 245 # factor (near-zero denominator) or a non-positive one (the overlap 246 # means have opposite signs, which would flip the back-history) means 247 # the data is not ratio-scale — fail loud rather than ship it. A 248 # large *magnitude* is fine: a legitimate base-period difference can 249 # need a 50x factor, so only sign and finiteness are guarded. 250 if not (math.isfinite(factor) and factor > 0): 251 raise ValueError( 252 f"splice: rebase factor for segment {i} ('{seg.name}') is {factor} over " 253 f"{lo}..{hi}. Multiplicative rebasing needs ratio-scale inputs (meaningful " 254 f"zero, proportional discrepancy); a non-finite or non-positive factor means " 255 f"the segments cross zero or differ additively. Pass rebase=False to coalesce " 256 f"raw levels instead." 257 ) 258 else: 259 factor, method, n, lo, hi = 1.0, "off", 0, None, None 260 seg_rebased = seg * factor 261 rows.append( 262 { 263 "segment": i, 264 "name": str(seg.name), 265 "freq_in": str(_pidx(segments[i]).freqstr), 266 "method": method, 267 "overlap_n": n, 268 "window_start": str(lo) if lo is not None else "", 269 "window_end": str(hi) if hi is not None else "", 270 "factor": round(factor, 6), 271 "fills_from": str(seg.dropna().index.min()), 272 } 273 ) 274 result = result.combine_first(seg_rebased) 275 276 # By default keep only the periods that actually carry data: do NOT reindex 277 # onto a dense grid (which would manufacture NaN for the gaps a coarse 278 # back-history leaves on a finer grid) and do NOT interpolate. A long-run 279 # series therefore stays sparse where it is old and coarse, and plots as one 280 # continuous line with no holes and no invented points. 281 result = result.dropna().sort_index() 282 283 if output and output != grid: 284 result = _to_grid(result, output, agg).dropna().sort_index() 285 grid = output 286 287 if fill in ("ffill", "interpolate") and len(result): 288 # Explicit opt-in: densify onto the full grid, then fill. 289 full = pd.period_range(result.index.min(), result.index.max(), freq=grid) 290 result = result.reindex(full) 291 result = result.ffill() if fill == "ffill" else result.interpolate() 292 293 result.name = name or str(segments[0].name) 294 report = DataFrame(rows) 295 return result, report
Splice mixed-frequency segments into one series, highest priority first.
Parameters
segments
Ordered list of pandas Series (PeriodIndex or DatetimeIndex). The
first is highest priority: it wins where periods overlap and (when
rebase is on) sets the level everything else is rebased to.
target
Common-grid frequency (e.g. "M", "Q-DEC"). Defaults to the
finest frequency present (anchor clashes step one rank finer).
rebase
Off by default — segments are coalesced at their raw levels, with no
silent transformation of your data. Set True to multiplicatively
rescale each lower-priority segment to the running result's level before
coalescing. Rebasing assumes ratio-scale inputs (meaningful zero,
proportional discrepancy between segments) — splicing index series on
different base periods (CPI, price/volume indices) is the case that
needs it. It is wrong for zero-crossing series (rates, balances) or
additive level breaks, and it invents a correction when same-unit
segments already agree — which is why it is opt-in. A non-finite or
non-positive factor raises. See the module docstring's rebase step.
agg
Aggregator used when a segment is finer than the grid (or when
downsampling to output). "mean" for index levels; use "sum"
for flows.
output
Optional final frequency to resample the spliced result to.
fill
Optional gap fill. By default (None) the result contains only the
periods that actually have data — no NaN rows are inserted for the gaps
a coarse segment leaves on a finer grid, and nothing is interpolated.
"ffill" or "interpolate" densify the result onto the full grid
first and then fill.
name
Name for the result series (defaults to the first segment's name).
Returns
tuple[Series, DataFrame] The spliced series and a one-row-per-junction report.
303def select_one(data: dict[str, DataFrame], meta: DataFrame, selector: dict[str, str]) -> Series: 304 """Select the single Series for one ``(data, meta, selector)`` — the single-source wrapper. 305 306 Convenience for the common one-selector case; equivalent to 307 ``select([(data, meta, selector)])[0]``. Returns the Series named by its 308 Series ID, with its ABS unit on ``.attrs["unit"]``. 309 """ 310 table, series_id, unit = find_abs_id(meta, selector, validate_unique=True) 311 s = data[table][series_id].copy() 312 s.name = series_id 313 s.attrs["unit"] = str(unit) 314 return s
Select the single Series for one (data, meta, selector) — the single-source wrapper.
Convenience for the common one-selector case; equivalent to
select([(data, meta, selector)])[0]. Returns the Series named by its
Series ID, with its ABS unit on .attrs["unit"].
317def select(sources: Iterable[Source], *, require_same_units: bool = True) -> list[Series]: 318 """Select a series for each ``(data, meta, selector)`` — the iterable in, iterable out. 319 320 The composable selection primitive: takes the iterable of ``(data, meta, 321 selector)`` sources and returns the matching list of Series, ready to hand to 322 :func:`splice` (directly, or after a per-series transform). Each selection 323 goes through ``readabs.find_abs_id`` with ``validate_unique=True``, which 324 de-duplicates on Series ID first — so a selector matching the same series in 325 several tables resolves cleanly, while one matching two genuinely different 326 series raises rather than guessing. 327 328 Parameters 329 ---------- 330 sources 331 Iterable of ``(data, meta, selector)``: 332 333 - ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``. 334 - ``meta`` — the matching metadata DataFrame. 335 - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g. 336 ``{"Index Numbers ; All groups CPI ; Australia ;": mc.did, 337 "Index Numbers": mc.unit, "Quarter": mc.freq}``. 338 require_same_units 339 If ``True`` (default) **raise** when the selected series do not all share 340 the same ABS unit — units must cohere to be spliced. Set ``False`` when 341 you deliberately select different-unit series together (e.g. two counts 342 and a rate that you will combine yourself). 343 344 Returns 345 ------- 346 list[Series] 347 One Series per source, each named by its Series ID with its ABS unit in 348 ``series.attrs["unit"]``. Unpack it (``a, b = select([...])``), map a 349 transform over it, or pass it straight to :func:`splice`. A later 350 transform drops the unit attr — correctly, since the unit is then no 351 longer the ABS one. 352 353 Raises 354 ------ 355 ValueError 356 If ``require_same_units`` and the selected series carry mixed units. 357 358 """ 359 segments = [select_one(data, meta, selector) for data, meta, selector in sources] 360 if require_same_units: 361 units = [str(s.attrs.get("unit", "")) for s in segments] 362 if len(set(units)) > 1: 363 detail = ", ".join(f"{s.name}={u!r}" for s, u in zip(segments, units, strict=True)) 364 raise ValueError( 365 f"select: selected series have mismatched units ({detail}). Pass " 366 f"require_same_units=False to select different-unit series together." 367 ) 368 return segments
Select a series for each (data, meta, selector) — the iterable in, iterable out.
The composable selection primitive: takes the iterable of (data, meta,
selector) sources and returns the matching list of Series, ready to hand to
splice() (directly, or after a per-series transform). Each selection
goes through readabs.find_abs_id with validate_unique=True, which
de-duplicates on Series ID first — so a selector matching the same series in
several tables resolves cleanly, while one matching two genuinely different
series raises rather than guessing.
Parameters
sources
Iterable of (data, meta, selector):
- ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
- ``meta`` — the matching metadata DataFrame.
- ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, e.g.
``{"Index Numbers ; All groups CPI ; Australia ;": mc.did,
"Index Numbers": mc.unit, "Quarter": mc.freq}``.
require_same_units
If True (default) raise when the selected series do not all share
the same ABS unit — units must cohere to be spliced. Set False when
you deliberately select different-unit series together (e.g. two counts
and a rate that you will combine yourself).
Returns
list[Series]
One Series per source, each named by its Series ID with its ABS unit in
series.attrs["unit"]. Unpack it (a, b = select([...])), map a
transform over it, or pass it straight to splice(). A later
transform drops the unit attr — correctly, since the unit is then no
longer the ABS one.
Raises
ValueError
If require_same_units and the selected series carry mixed units.
371def select_and_splice( 372 sources: Iterable[Source], 373 *, 374 target: str | None = None, 375 rebase: bool = False, 376 agg: str = "mean", 377 output: str | None = None, 378 fill: Literal["ffill", "interpolate"] | None = None, 379 name: str | None = None, 380 require_same_units: bool = True, 381) -> tuple[Series, str, DataFrame]: 382 """Select one series per source and :func:`splice` them — the no-transform case. 383 384 Sugar for ``splice(select(*src) for src in sources)`` with a unit guard. When 385 you need a transform *between* selecting and splicing (e.g. a growth rate), 386 compose :func:`select` and :func:`splice` directly instead — that is the whole 387 reason :func:`select` is exposed separately. 388 389 Parameters 390 ---------- 391 sources 392 Ordered iterable of ``(data, meta, selector)``, **highest priority 393 first** (same priority rule as :func:`splice`): 394 395 - ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``. 396 - ``meta`` — the matching metadata DataFrame. 397 - ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``, 398 e.g. ``{"Index Numbers ; All groups CPI ; Australia ;": mc.did, 399 "Index Numbers": mc.unit, "Quarter": mc.freq}``. In the common case 400 the only thing differing between two sources is the frequency, so a 401 shared *base* selector composes with ``base | {"Quarter": mc.freq}``. 402 target, rebase, agg, output, fill, name 403 Passed straight through to :func:`splice`. 404 require_same_units 405 Forwarded to :func:`select`: if ``True`` (default) raise when the 406 selected segments carry mixed units; ``False`` overrides (the result is 407 then labelled with the highest-priority segment's unit). 408 409 Returns 410 ------- 411 tuple[Series, str, DataFrame] 412 The spliced series, its unit (the highest-priority segment's unit), and 413 the :func:`splice` join report, augmented with ``series_id`` and 414 ``unit`` columns recording what each segment resolved to. 415 416 """ 417 segments = select(sources, require_same_units=require_same_units) 418 units = [str(s.attrs.get("unit", "")) for s in segments] 419 420 result, report = splice( 421 segments, target=target, rebase=rebase, agg=agg, output=output, fill=fill, name=name 422 ) 423 # Audit trail: which Series ID / unit did each reported (lower-priority) segment use? 424 if len(report): 425 seg = [int(i) for i in report["segment"]] 426 report.insert(1, "series_id", [str(segments[i].name) for i in seg]) 427 report.insert(2, "unit", [units[i] for i in seg]) 428 return result, units[0], report
Select one series per source and splice() them — the no-transform case.
Sugar for splice(select(*src) for src in sources) with a unit guard. When
you need a transform between selecting and splicing (e.g. a growth rate),
compose select() and splice() directly instead — that is the whole
reason select() is exposed separately.
Parameters
sources
Ordered iterable of (data, meta, selector), highest priority
first (same priority rule as splice()):
- ``data`` — ``dict[table_name, DataFrame]`` from ``read_abs_cat``.
- ``meta`` — the matching metadata DataFrame.
- ``selector`` — ``{search_value: meta_column}`` for ``find_abs_id``,
e.g. ``{"Index Numbers ; All groups CPI ; Australia ;": mc.did,
"Index Numbers": mc.unit, "Quarter": mc.freq}``. In the common case
the only thing differing between two sources is the frequency, so a
shared *base* selector composes with ``base | {"Quarter": mc.freq}``.
target, rebase, agg, output, fill, name
Passed straight through to splice().
require_same_units
Forwarded to select(): if True (default) raise when the
selected segments carry mixed units; False overrides (the result is
then labelled with the highest-priority segment's unit).
Returns
tuple[Series, str, DataFrame]
The spliced series, its unit (the highest-priority segment's unit), and
the splice() join report, augmented with series_id and
unit columns recording what each segment resolved to.