readabs.read_abs_cat

Download timeseries data from the Australian Bureau of Statistics.

Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.

  1"""Download *timeseries* data from the Australian Bureau of Statistics.
  2
  3Download timeseries data from the Australian Bureau of Statistics (ABS)
  4for a specified ABS catalogue identifier.
  5"""
  6
  7import calendar
  8from functools import cache
  9from typing import Any, Unpack
 10
 11import pandas as pd
 12from pandas import DataFrame
 13
 14from readabs.abs_meta_data import metacol
 15from readabs.grab_abs_url import grab_abs_url, grab_abs_zip
 16from readabs.read_support import HYPHEN, ReadArgs
 17
 18# Constants
 19MAX_DATETIME_CHARS = 20
 20TABLE_DESC_ROW = 4
 21TABLE_DESC_COL = 1
 22
 23
 24# --- functions ---
 25# - public -
 26@cache  # minimise slowness for any repeat business
 27def read_abs_cat(
 28    cat: str,
 29    **kwargs: Unpack[ReadArgs],
 30) -> tuple[dict[str, DataFrame], DataFrame]:
 31    """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
 32
 33    This function returns the complete ABS Catalogue information as a
 34    python dictionary of pandas DataFrames, as well as the associated metadata
 35    in a separate DataFrame. The function automates the collection of zip and
 36    excel files from the ABS website. If necessary, these files are downloaded,
 37    and saved into a cache directory. The files are then parsed to extract time
 38    series data, and the associated metadata.
 39
 40    By default, the cache directory is `./.readabs_cache/`. You can change the
 41    default directory name by setting the shell environment variable
 42    `READABS_CACHE_DIR` with the name of the preferred directory.
 43
 44    Parameters
 45    ----------
 46    cat : str
 47        The ABS Catalogue Number for the data to be downloaded and made
 48        available by this function. This argument must be specified in the
 49        function call.
 50
 51    **kwargs : Unpack[ReadArgs]
 52        The following parameters may be passed as optional keyword arguments.
 53
 54    url : str = ""
 55        The URL of an ABS landing page. Use this for discontinued series
 56        that are no longer in the ABS Time Series Directory. If provided,
 57        data will be retrieved from this URL instead of looking up the
 58        catalogue number. Example:
 59        `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")`
 60
 61    keep_non_ts : bool = False
 62        A flag for whether to keep the non-time-series tables
 63        that might form part of an ABS catalogue item. Normally, the
 64        non-time-series information is ignored, and not made available to
 65        the user.
 66
 67    history : str = ""
 68        Provide a month-year string to extract historical ABS data.
 69        For example, you can set history="dec-2023" to the get the ABS data
 70        for a catalogue identifier that was originally published in respect
 71        of Q4 of 2023. Note: not all ABS data sources are structured so that
 72        this technique works in every case; but most are.
 73
 74    verbose : bool = False
 75        Setting this to true may help diagnose why something
 76        might be going wrong with the data retrieval process.
 77
 78    ignore_errors : bool = False
 79        Normally, this function will cease downloading when
 80        an error in encountered. However, sometimes the ABS website has
 81        malformed links, and changing this setting is necessitated. (Note:
 82        if you drop a message to the ABS, they will usually fix broken
 83        links with a business day).
 84
 85    get_zip : bool = True
 86        Download the excel files in .zip files.
 87
 88    get_excel_if_no_zip : bool = True
 89        Only try to download .xlsx files if there are no zip
 90        files available to be downloaded. Only downloading individual excel
 91        files when there are no zip files to download can speed up the
 92        download process.
 93
 94    get_excel : bool = False
 95        The default value means that excel files are not
 96        automatically download. Note: at least one of `get_zip`,
 97        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 98        catalogue items, it is sufficient to just download the one zip
 99        file. But note, some catalogue items do not have a zip file.
100        Others have quite a number of zip files.
101
102    single_excel_only : str = ""
103        If this argument is set to a table name (without the
104        .xlsx extension), only that excel file will be downloaded. If
105        set, and only a limited subset of available data is needed,
106        this can speed up download times significantly. Note: overrides
107        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
108
109    selected_excel : tuple[str, ...] = ()
110        If set to a tuple of table names (without the .xlsx extension),
111        only those excel files will be downloaded. Useful when several
112        specific tables are needed and downloading the full zip would
113        be wasteful. Example:
114        `selected_excel=("62020001", "62020017", "62020X28")`.
115        Must be a tuple (not a list) because `read_abs_cat` uses an
116        internal cache that requires hashable arguments. Note: overrides
117        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`
118        when at least one matching file is found.
119
120    single_zip_only : str = ""
121        If this argument is set to a zip file name (without
122        the .zip extension), only that zip file will be downloaded.
123        If set, and only a limited subset of available data is needed,
124        this can speed up download times significantly. Note: overrides
125        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
126
127    cache_only : bool = False
128        If set to True, this function will only access
129        data that has been previously cached. Normally, the function
130        checks the date of the cache data against the date of the data
131        on the ABS website, before deciding whether the ABS has fresher
132        data that needs to be downloaded to the cache.
133
134    zip_file: str | Path = ""
135        If set to a specific zip file name (with or without the .zip
136        extension), this function will only extract data from that zip file
137        on the local file system. This may be useful for debugging purposes.
138
139    Returns
140    -------
141    tuple[dict[str, DataFrame], DataFrame]
142        The function returns a tuple of two items. The first item is a
143        python dictionary of pandas DataFrames (which is the primary data
144        associated with the ABS catalogue item). The second item is a
145        DataFrame of ABS metadata for the ABS collection.
146
147        Note:
148        You can retrieve non-timeseries data using the grab_abs_url()
149        function. That takes the URL for the ABS landing page for the ABS
150        collection you are interested in. The read_abs_cat function is for
151        ABS catalogue identifiers which are timeseries data, for which the
152        metadata can be extracted.
153
154    Example
155    -------
156
157    ```python
158    import readabs as ra
159    from pandas import DataFrame
160    cat_num = "6202.0"  # The ABS labour force survey
161    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
162    abs_dict, meta = data
163    ```
164
165    """
166    # --- get the time series data ---
167    if kwargs.get("zip_file"):
168        raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs)
169    else:
170        raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
171    response = _get_time_series_data(cat, raw_abs_dict, **kwargs)
172
173    if not response:
174        response = {}, DataFrame()
175
176    return response  # dictionary of DataFrames, and a DataFrame of metadata
177
178
179# - private -
180def _get_time_series_data(
181    cat: str,
182    abs_dict: dict[str, DataFrame],
183    **kwargs: Any,  # keep_non_ts, verbose, ignore_errors
184) -> tuple[dict[str, DataFrame], DataFrame]:
185    """Extract the time series data for a specific ABS catalogue identifier."""
186    # --- set up ---
187    cat = "<catalogue number missing>" if not cat.strip() else cat.strip()
188    new_dict: dict[str, DataFrame] = {}
189    meta_data = DataFrame()
190
191    # --- group the sheets and iterate over these groups
192    long_groups = _group_sheets(abs_dict)
193    for table, sheets in long_groups.items():
194        args = {
195            "cat": cat,
196            "from_dict": abs_dict,
197            "table": table,
198            "long_sheets": sheets,
199        }
200        new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs)
201    return new_dict, meta_data
202
203
204def _copy_raw_sheets(
205    from_dict: dict[str, DataFrame],
206    long_sheets: list[str],
207    to_dict: dict[str, DataFrame],
208    *,
209    keep_non_ts: bool,
210) -> dict[str, DataFrame]:
211    """Copy the raw sheets across to the final dictionary.
212
213    Used if the data is not in a timeseries format, and keep_non_ts
214    flag is set to True. Returns an updated final dictionary.
215    """
216    if not keep_non_ts:
217        return to_dict
218
219    for sheet in long_sheets:
220        if sheet in from_dict:
221            to_dict[sheet] = from_dict[sheet]
222        else:
223            # should not happen
224            raise ValueError(f"Glitch: Sheet {sheet} not found in the data.")
225    return to_dict
226
227
228def _capture(
229    to_dict: dict[str, DataFrame],
230    meta_data: DataFrame,
231    args: dict[str, Any],
232    **kwargs: Any,  # keep_non_ts, ignore_errors
233) -> tuple[dict[str, DataFrame], DataFrame]:
234    """Capture the time series data and meta data from an Excel file.
235
236    For a specific Excel file, capture *both* the time series data
237    from the ABS data files as well as the meta data. These data are
238    added to the input 'to_dict' and 'meta_data' respectively, and
239    the combined results are returned as a tuple.
240    """
241    # --- step 0: set up ---
242    keep_non_ts: bool = kwargs.get("keep_non_ts", False)
243    ignore_errors: bool = kwargs.get("ignore_errors", False)
244
245    # --- step 1: capture the meta data ---
246    short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]]
247    if "Index" not in short_names:
248        print(f"Table {args['table']} has no 'Index' sheet.")
249        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
250        return to_dict, meta_data
251    index = short_names.index("Index")
252
253    index_sheet = args["long_sheets"][index]
254    this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet)
255    if this_meta.empty:
256        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
257        return to_dict, meta_data
258
259    meta_data = pd.concat([meta_data, this_meta], axis=0)
260
261    # --- step 2: capture the actual time series data ---
262    data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs)
263    if len(data):
264        to_dict[args["table"]] = data
265    else:
266        # a glitch: we have the metadata but not the actual data
267        error = f"Unexpected: {args['table']} has no actual data."
268        if not ignore_errors:
269            raise ValueError(error)
270        print(error)
271        to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts)
272
273    return to_dict, meta_data
274
275
276def _capture_data(
277    abs_meta: DataFrame,
278    from_dict: dict[str, DataFrame],
279    long_sheets: list[str],
280    **kwargs: Any,  # verbose
281) -> DataFrame:
282    """Take a list of ABS data sheets and stitch them into a DataFrame.
283
284    Find the DataFrames for those sheets in the from_dict, and stitch them
285    into a single DataFrame with an appropriate PeriodIndex.
286    """
287    # --- step 0: set up ---
288    verbose: bool = kwargs.get("verbose", False)
289    merged_data = DataFrame()
290    header_row: int = 8
291
292    # --- step 1: capture the time series data ---
293    # identify the data sheets in the list of all sheets from Excel file
294    data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")]
295
296    for sheet_name in data_sheets:
297        if verbose:
298            print(f"About to cature data from {sheet_name=}")
299
300        # --- capture just the data, nothing else
301        sheet_data = from_dict[sheet_name].copy()
302
303        # get the columns
304        header = sheet_data.iloc[header_row]
305        sheet_data.columns = pd.Index(header)
306        sheet_data = sheet_data[(header_row + 1) :]
307
308        # get the row indexes
309        sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose=verbose)
310
311        # --- merge data into a single dataframe
312        if len(merged_data) == 0:
313            merged_data = sheet_data
314        else:
315            merged_data = merged_data.merge(
316                right=sheet_data,
317                how="outer",
318                left_index=True,
319                right_index=True,
320                suffixes=("", ""),
321            )
322
323    # --- step 2 - final tidy-ups
324    # remove NA rows
325    merged_data = merged_data.dropna(how="all")
326    # check for NA columns - rarely happens
327    # Note: these empty columns are not removed,
328    # but it is useful to know they are there
329    if merged_data.isna().all().any() and verbose:
330        na_cols = merged_data.columns[merged_data.isna().all()]
331        print(f"Caution: These columns are all NA: {list(na_cols)}")
332
333    # check for duplicate columns - should not happen
334    # Note: these duplicate columns are removed
335    duplicates = merged_data.columns.duplicated()
336    if duplicates.any():
337        if verbose:
338            dup_table = abs_meta[metacol.table].iloc[0]
339            print(f"Note: duplicates removed from {dup_table}: " + f"{merged_data.columns[duplicates]}")
340        merged_data = merged_data.loc[:, ~duplicates].copy()
341
342    # make the data all floats.
343    return merged_data.astype(float).sort_index()
344
345
346def _index_to_period(sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, *, verbose: bool) -> DataFrame:
347    """Convert the index of a DataFrame to a PeriodIndex."""
348    index_column = sheet_data[sheet_data.columns[0]].astype(str)
349    sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1)
350    long_row_names = index_column.str.len() > MAX_DATETIME_CHARS  # 19 chars in datetime str
351    if verbose and long_row_names.any():
352        print(f"You may need to check index column for {sheet_name}")
353    index_column = index_column.loc[~long_row_names]
354    sheet_data = sheet_data.loc[~long_row_names]
355
356    proposed_index = pd.to_datetime(index_column)
357
358    # get the correct period index
359    short_name = sheet_name.split(HYPHEN, 1)[0]
360    series_id = sheet_data.columns[0]
361    freq_value = abs_meta[abs_meta[metacol.table] == short_name].loc[series_id, metacol.freq]
362    freq = str(freq_value).upper().strip()[0]
363    freq = "Y" if freq == "A" else freq  # pandas prefers yearly
364    freq = "Q" if freq == "B" else freq  # treat Biannual as quarterly
365    if freq not in ("Y", "Q", "M", "D"):
366        print(f"Check the frequency of the data in sheet: {sheet_name}")
367
368    # create an appropriate period index
369    if freq:
370        if freq in ("Q", "Y"):
371            month = str(calendar.month_abbr[proposed_index.dt.month.max()]).upper()
372            freq = f"{freq}-{month}"
373        sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq)
374    else:
375        raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex")
376
377    return sheet_data
378
379
380def _capture_meta(
381    cat: str,
382    from_dict: dict[str, DataFrame],
383    index_sheet: str,
384) -> DataFrame:
385    """Capture the metadata from the Index sheet of an ABS excel file.
386
387    Returns a DataFrame specific to the current excel file.
388    Returning an empty DataFrame, means that the meta data could not
389    be identified. Meta data for each ABS data item is organised by row.
390    """
391    # --- step 0: set up ---
392    frame = from_dict[index_sheet]
393
394    # --- step 1: check if the metadata is present in the right place ---
395    # Unfortunately, the header for some of the 3401.0
396    #                spreadsheets starts on row 10
397    starting_rows = 8, 9, 10
398    required = metacol.did, metacol.id, metacol.stype, metacol.unit
399    required_set = set(required)
400
401    header_row = None
402    header_columns = None
403    for row in starting_rows:
404        columns = frame.iloc[row]
405        if required_set.issubset(set(columns)):
406            header_row = row
407            header_columns = columns
408            break
409
410    if header_row is None or header_columns is None:
411        print(f"Table has no metadata in sheet {index_sheet}.")
412        return DataFrame()
413
414    # --- step 2: capture the metadata ---
415    file_meta = frame.iloc[header_row + 1 :].copy()
416    file_meta.columns = pd.Index(header_columns)
417
418    # make damn sure there are no rogue white spaces
419    for col in required:
420        file_meta[col] = file_meta[col].str.strip()
421
422    # remove empty columns and rows
423    file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0)
424
425    # populate the metadata
426    file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0]
427    tab_desc_value = frame.iloc[TABLE_DESC_ROW, TABLE_DESC_COL]
428    tab_desc = str(tab_desc_value).split(".", 1)[-1].strip()
429    file_meta[metacol.tdesc] = tab_desc
430    file_meta[metacol.cat] = cat
431
432    # drop last row - should just be copyright statement
433    file_meta = file_meta.iloc[:-1]
434
435    # set the index to the series_id
436    file_meta.index = pd.Index(file_meta[metacol.id])
437
438    return file_meta
439
440
441def _group_sheets(
442    abs_dict: dict[str, DataFrame],
443) -> dict[str, list[str]]:
444    """Group the sheets from an Excel file."""
445    keys = list(abs_dict.keys())
446    long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys]
447
448    def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]:
449        groups: dict[str, list[str]] = {}
450        for x, y in p_list:
451            if x not in groups:
452                groups[x] = []
453            groups[x].append(y)
454        return groups
455
456    return group(long_pairs)
457
458
459# --- initial testing ---
460if __name__ == "__main__":
461
462    def simple_test() -> None:
463        """Test the read_abs_cat function."""
464        # ABS Catalogue ID 8731.0 has a mix of time
465        # series and non-time series data. Also,
466        # it has unusually structured Excel files. So, a good test.
467
468        print("Starting test.")
469
470        d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False)
471        print(f"--- {len(d)=} ---")
472        print(f"--- {d.keys()=} ---")
473        for table in d:
474            freq_str = getattr(d[table].index, "freqstr", "Unknown")
475            print(f"{table=} {d[table].shape=} {freq_str=}")
476
477        print ("=" * 20)
478
479        d, _m = read_abs_cat("", zip_file=".test-data/Qrtly-CPI-Time-series-spreadsheets-all.zip", verbose=False)
480        print(f"--- {len(d)=} ---")
481        print(f"--- {d.keys()=} ---")
482        for table in d:
483            freq_str = getattr(d[table].index, "freqstr", "Unknown")
484            print(f"{table=} {d[table].shape=} {freq_str=}")
485
486        print("Test complete.")
487
488    simple_test()
MAX_DATETIME_CHARS = 20
TABLE_DESC_ROW = 4
TABLE_DESC_COL = 1
@cache
def read_abs_cat( cat: str, **kwargs: Unpack[readabs.ReadArgs]) -> tuple[dict[str, pandas.DataFrame], pandas.DataFrame]:
 27@cache  # minimise slowness for any repeat business
 28def read_abs_cat(
 29    cat: str,
 30    **kwargs: Unpack[ReadArgs],
 31) -> tuple[dict[str, DataFrame], DataFrame]:
 32    """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
 33
 34    This function returns the complete ABS Catalogue information as a
 35    python dictionary of pandas DataFrames, as well as the associated metadata
 36    in a separate DataFrame. The function automates the collection of zip and
 37    excel files from the ABS website. If necessary, these files are downloaded,
 38    and saved into a cache directory. The files are then parsed to extract time
 39    series data, and the associated metadata.
 40
 41    By default, the cache directory is `./.readabs_cache/`. You can change the
 42    default directory name by setting the shell environment variable
 43    `READABS_CACHE_DIR` with the name of the preferred directory.
 44
 45    Parameters
 46    ----------
 47    cat : str
 48        The ABS Catalogue Number for the data to be downloaded and made
 49        available by this function. This argument must be specified in the
 50        function call.
 51
 52    **kwargs : Unpack[ReadArgs]
 53        The following parameters may be passed as optional keyword arguments.
 54
 55    url : str = ""
 56        The URL of an ABS landing page. Use this for discontinued series
 57        that are no longer in the ABS Time Series Directory. If provided,
 58        data will be retrieved from this URL instead of looking up the
 59        catalogue number. Example:
 60        `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")`
 61
 62    keep_non_ts : bool = False
 63        A flag for whether to keep the non-time-series tables
 64        that might form part of an ABS catalogue item. Normally, the
 65        non-time-series information is ignored, and not made available to
 66        the user.
 67
 68    history : str = ""
 69        Provide a month-year string to extract historical ABS data.
 70        For example, you can set history="dec-2023" to the get the ABS data
 71        for a catalogue identifier that was originally published in respect
 72        of Q4 of 2023. Note: not all ABS data sources are structured so that
 73        this technique works in every case; but most are.
 74
 75    verbose : bool = False
 76        Setting this to true may help diagnose why something
 77        might be going wrong with the data retrieval process.
 78
 79    ignore_errors : bool = False
 80        Normally, this function will cease downloading when
 81        an error in encountered. However, sometimes the ABS website has
 82        malformed links, and changing this setting is necessitated. (Note:
 83        if you drop a message to the ABS, they will usually fix broken
 84        links with a business day).
 85
 86    get_zip : bool = True
 87        Download the excel files in .zip files.
 88
 89    get_excel_if_no_zip : bool = True
 90        Only try to download .xlsx files if there are no zip
 91        files available to be downloaded. Only downloading individual excel
 92        files when there are no zip files to download can speed up the
 93        download process.
 94
 95    get_excel : bool = False
 96        The default value means that excel files are not
 97        automatically download. Note: at least one of `get_zip`,
 98        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 99        catalogue items, it is sufficient to just download the one zip
100        file. But note, some catalogue items do not have a zip file.
101        Others have quite a number of zip files.
102
103    single_excel_only : str = ""
104        If this argument is set to a table name (without the
105        .xlsx extension), only that excel file will be downloaded. If
106        set, and only a limited subset of available data is needed,
107        this can speed up download times significantly. Note: overrides
108        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
109
110    selected_excel : tuple[str, ...] = ()
111        If set to a tuple of table names (without the .xlsx extension),
112        only those excel files will be downloaded. Useful when several
113        specific tables are needed and downloading the full zip would
114        be wasteful. Example:
115        `selected_excel=("62020001", "62020017", "62020X28")`.
116        Must be a tuple (not a list) because `read_abs_cat` uses an
117        internal cache that requires hashable arguments. Note: overrides
118        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`
119        when at least one matching file is found.
120
121    single_zip_only : str = ""
122        If this argument is set to a zip file name (without
123        the .zip extension), only that zip file will be downloaded.
124        If set, and only a limited subset of available data is needed,
125        this can speed up download times significantly. Note: overrides
126        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
127
128    cache_only : bool = False
129        If set to True, this function will only access
130        data that has been previously cached. Normally, the function
131        checks the date of the cache data against the date of the data
132        on the ABS website, before deciding whether the ABS has fresher
133        data that needs to be downloaded to the cache.
134
135    zip_file: str | Path = ""
136        If set to a specific zip file name (with or without the .zip
137        extension), this function will only extract data from that zip file
138        on the local file system. This may be useful for debugging purposes.
139
140    Returns
141    -------
142    tuple[dict[str, DataFrame], DataFrame]
143        The function returns a tuple of two items. The first item is a
144        python dictionary of pandas DataFrames (which is the primary data
145        associated with the ABS catalogue item). The second item is a
146        DataFrame of ABS metadata for the ABS collection.
147
148        Note:
149        You can retrieve non-timeseries data using the grab_abs_url()
150        function. That takes the URL for the ABS landing page for the ABS
151        collection you are interested in. The read_abs_cat function is for
152        ABS catalogue identifiers which are timeseries data, for which the
153        metadata can be extracted.
154
155    Example
156    -------
157
158    ```python
159    import readabs as ra
160    from pandas import DataFrame
161    cat_num = "6202.0"  # The ABS labour force survey
162    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
163    abs_dict, meta = data
164    ```
165
166    """
167    # --- get the time series data ---
168    if kwargs.get("zip_file"):
169        raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs)
170    else:
171        raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
172    response = _get_time_series_data(cat, raw_abs_dict, **kwargs)
173
174    if not response:
175        response = {}, DataFrame()
176
177    return response  # dictionary of DataFrames, and a DataFrame of metadata

For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.

This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.

By default, the cache directory is ./.readabs_cache/. You can change the default directory name by setting the shell environment variable READABS_CACHE_DIR with the name of the preferred directory.

Parameters

cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.

**kwargs : Unpack[ReadArgs] The following parameters may be passed as optional keyword arguments.

url : str = "" The URL of an ABS landing page. Use this for discontinued series that are no longer in the ABS Time Series Directory. If provided, data will be retrieved from this URL instead of looking up the catalogue number. Example: read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")

keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.

history : str = "" Provide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.

verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.

ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).

get_zip : bool = True Download the excel files in .zip files.

get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.

get_excel : bool = False The default value means that excel files are not automatically download. Note: at least one of get_zip, get_excel_if_no_zip, or get_excel must be true. For most ABS catalogue items, it is sufficient to just download the one zip file. But note, some catalogue items do not have a zip file. Others have quite a number of zip files.

single_excel_only : str = "" If this argument is set to a table name (without the .xlsx extension), only that excel file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, get_excel and single_zip_only.

selected_excel : tuple[str, ...] = () If set to a tuple of table names (without the .xlsx extension), only those excel files will be downloaded. Useful when several specific tables are needed and downloading the full zip would be wasteful. Example: selected_excel=("62020001", "62020017", "62020X28"). Must be a tuple (not a list) because read_abs_cat uses an internal cache that requires hashable arguments. Note: overrides get_zip, get_excel_if_no_zip, get_excel and single_zip_only when at least one matching file is found.

single_zip_only : str = "" If this argument is set to a zip file name (without the .zip extension), only that zip file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, and get_excel.

cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.

zip_file: str | Path = "" If set to a specific zip file name (with or without the .zip extension), this function will only extract data from that zip file on the local file system. This may be useful for debugging purposes.

Returns

tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.

Note:
You can retrieve non-timeseries data using the grab_abs_url()
function. That takes the URL for the ABS landing page for the ABS
collection you are interested in. The read_abs_cat function is for
ABS catalogue identifiers which are timeseries data, for which the
metadata can be extracted.

Example

import readabs as ra
from pandas import DataFrame
cat_num = "6202.0"  # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data