readabs.read_abs_cat
Download timeseries data from the Australian Bureau of Statistics.
Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.
1"""Download *timeseries* data from the Australian Bureau of Statistics. 2 3Download timeseries data from the Australian Bureau of Statistics (ABS) 4for a specified ABS catalogue identifier. 5""" 6 7import calendar 8from functools import cache 9from typing import Any, Unpack 10 11import pandas as pd 12from pandas import DataFrame 13 14from readabs.abs_meta_data import metacol 15from readabs.grab_abs_url import grab_abs_url, grab_abs_zip 16from readabs.read_support import HYPHEN, ReadArgs 17 18# Constants 19MAX_DATETIME_CHARS = 20 20TABLE_DESC_ROW = 4 21TABLE_DESC_COL = 1 22 23 24# --- functions --- 25# - public - 26@cache # minimise slowness for any repeat business 27def read_abs_cat( 28 cat: str, 29 **kwargs: Unpack[ReadArgs], 30) -> tuple[dict[str, DataFrame], DataFrame]: 31 """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames. 32 33 This function returns the complete ABS Catalogue information as a 34 python dictionary of pandas DataFrames, as well as the associated metadata 35 in a separate DataFrame. The function automates the collection of zip and 36 excel files from the ABS website. If necessary, these files are downloaded, 37 and saved into a cache directory. The files are then parsed to extract time 38 series data, and the associated metadata. 39 40 By default, the cache directory is `./.readabs_cache/`. You can change the 41 default directory name by setting the shell environment variable 42 `READABS_CACHE_DIR` with the name of the preferred directory. 43 44 Parameters 45 ---------- 46 cat : str 47 The ABS Catalogue Number for the data to be downloaded and made 48 available by this function. This argument must be specified in the 49 function call. 50 51 **kwargs : Unpack[ReadArgs] 52 The following parameters may be passed as optional keyword arguments. 53 54 url : str = "" 55 The URL of an ABS landing page. Use this for discontinued series 56 that are no longer in the ABS Time Series Directory. If provided, 57 data will be retrieved from this URL instead of looking up the 58 catalogue number. Example: 59 `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")` 60 61 keep_non_ts : bool = False 62 A flag for whether to keep the non-time-series tables 63 that might form part of an ABS catalogue item. Normally, the 64 non-time-series information is ignored, and not made available to 65 the user. 66 67 history : str = "" 68 Provide a month-year string to extract historical ABS data. 69 For example, you can set history="dec-2023" to the get the ABS data 70 for a catalogue identifier that was originally published in respect 71 of Q4 of 2023. Note: not all ABS data sources are structured so that 72 this technique works in every case; but most are. 73 74 verbose : bool = False 75 Setting this to true may help diagnose why something 76 might be going wrong with the data retrieval process. 77 78 ignore_errors : bool = False 79 Normally, this function will cease downloading when 80 an error in encountered. However, sometimes the ABS website has 81 malformed links, and changing this setting is necessitated. (Note: 82 if you drop a message to the ABS, they will usually fix broken 83 links with a business day). 84 85 get_zip : bool = True 86 Download the excel files in .zip files. 87 88 get_excel_if_no_zip : bool = True 89 Only try to download .xlsx files if there are no zip 90 files available to be downloaded. Only downloading individual excel 91 files when there are no zip files to download can speed up the 92 download process. 93 94 get_excel : bool = False 95 The default value means that excel files are not 96 automatically download. Note: at least one of `get_zip`, 97 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 98 catalogue items, it is sufficient to just download the one zip 99 file. But note, some catalogue items do not have a zip file. 100 Others have quite a number of zip files. 101 102 single_excel_only : str = "" 103 If this argument is set to a table name (without the 104 .xlsx extension), only that excel file will be downloaded. If 105 set, and only a limited subset of available data is needed, 106 this can speed up download times significantly. Note: overrides 107 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 108 109 selected_excel : tuple[str, ...] = () 110 If set to a tuple of table names (without the .xlsx extension), 111 only those excel files will be downloaded. Useful when several 112 specific tables are needed and downloading the full zip would 113 be wasteful. Example: 114 `selected_excel=("62020001", "62020017", "62020X28")`. 115 Must be a tuple (not a list) because `read_abs_cat` uses an 116 internal cache that requires hashable arguments. Note: overrides 117 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only` 118 when at least one matching file is found. 119 120 single_zip_only : str = "" 121 If this argument is set to a zip file name (without 122 the .zip extension), only that zip file will be downloaded. 123 If set, and only a limited subset of available data is needed, 124 this can speed up download times significantly. Note: overrides 125 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 126 127 cache_only : bool = False 128 If set to True, this function will only access 129 data that has been previously cached. Normally, the function 130 checks the date of the cache data against the date of the data 131 on the ABS website, before deciding whether the ABS has fresher 132 data that needs to be downloaded to the cache. 133 134 zip_file: str | Path = "" 135 If set to a specific zip file name (with or without the .zip 136 extension), this function will only extract data from that zip file 137 on the local file system. This may be useful for debugging purposes. 138 139 Returns 140 ------- 141 tuple[dict[str, DataFrame], DataFrame] 142 The function returns a tuple of two items. The first item is a 143 python dictionary of pandas DataFrames (which is the primary data 144 associated with the ABS catalogue item). The second item is a 145 DataFrame of ABS metadata for the ABS collection. 146 147 Note: 148 You can retrieve non-timeseries data using the grab_abs_url() 149 function. That takes the URL for the ABS landing page for the ABS 150 collection you are interested in. The read_abs_cat function is for 151 ABS catalogue identifiers which are timeseries data, for which the 152 metadata can be extracted. 153 154 Example 155 ------- 156 157 ```python 158 import readabs as ra 159 from pandas import DataFrame 160 cat_num = "6202.0" # The ABS labour force survey 161 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 162 abs_dict, meta = data 163 ``` 164 165 """ 166 # --- get the time series data --- 167 if kwargs.get("zip_file"): 168 raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs) 169 else: 170 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 171 response = _get_time_series_data(cat, raw_abs_dict, **kwargs) 172 173 if not response: 174 response = {}, DataFrame() 175 176 return response # dictionary of DataFrames, and a DataFrame of metadata 177 178 179# - private - 180def _get_time_series_data( 181 cat: str, 182 abs_dict: dict[str, DataFrame], 183 **kwargs: Any, # keep_non_ts, verbose, ignore_errors 184) -> tuple[dict[str, DataFrame], DataFrame]: 185 """Extract the time series data for a specific ABS catalogue identifier.""" 186 # --- set up --- 187 cat = "<catalogue number missing>" if not cat.strip() else cat.strip() 188 new_dict: dict[str, DataFrame] = {} 189 meta_data = DataFrame() 190 191 # --- group the sheets and iterate over these groups 192 long_groups = _group_sheets(abs_dict) 193 for table, sheets in long_groups.items(): 194 args = { 195 "cat": cat, 196 "from_dict": abs_dict, 197 "table": table, 198 "long_sheets": sheets, 199 } 200 new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs) 201 return new_dict, meta_data 202 203 204def _copy_raw_sheets( 205 from_dict: dict[str, DataFrame], 206 long_sheets: list[str], 207 to_dict: dict[str, DataFrame], 208 *, 209 keep_non_ts: bool, 210) -> dict[str, DataFrame]: 211 """Copy the raw sheets across to the final dictionary. 212 213 Used if the data is not in a timeseries format, and keep_non_ts 214 flag is set to True. Returns an updated final dictionary. 215 """ 216 if not keep_non_ts: 217 return to_dict 218 219 for sheet in long_sheets: 220 if sheet in from_dict: 221 to_dict[sheet] = from_dict[sheet] 222 else: 223 # should not happen 224 raise ValueError(f"Glitch: Sheet {sheet} not found in the data.") 225 return to_dict 226 227 228def _capture( 229 to_dict: dict[str, DataFrame], 230 meta_data: DataFrame, 231 args: dict[str, Any], 232 **kwargs: Any, # keep_non_ts, ignore_errors 233) -> tuple[dict[str, DataFrame], DataFrame]: 234 """Capture the time series data and meta data from an Excel file. 235 236 For a specific Excel file, capture *both* the time series data 237 from the ABS data files as well as the meta data. These data are 238 added to the input 'to_dict' and 'meta_data' respectively, and 239 the combined results are returned as a tuple. 240 """ 241 # --- step 0: set up --- 242 keep_non_ts: bool = kwargs.get("keep_non_ts", False) 243 ignore_errors: bool = kwargs.get("ignore_errors", False) 244 245 # --- step 1: capture the meta data --- 246 short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]] 247 if "Index" not in short_names: 248 print(f"Table {args['table']} has no 'Index' sheet.") 249 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 250 return to_dict, meta_data 251 index = short_names.index("Index") 252 253 index_sheet = args["long_sheets"][index] 254 this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet) 255 if this_meta.empty: 256 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 257 return to_dict, meta_data 258 259 meta_data = pd.concat([meta_data, this_meta], axis=0) 260 261 # --- step 2: capture the actual time series data --- 262 data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs) 263 if len(data): 264 to_dict[args["table"]] = data 265 else: 266 # a glitch: we have the metadata but not the actual data 267 error = f"Unexpected: {args['table']} has no actual data." 268 if not ignore_errors: 269 raise ValueError(error) 270 print(error) 271 to_dict = _copy_raw_sheets(args["from_dict"], args["long_sheets"], to_dict, keep_non_ts=keep_non_ts) 272 273 return to_dict, meta_data 274 275 276def _capture_data( 277 abs_meta: DataFrame, 278 from_dict: dict[str, DataFrame], 279 long_sheets: list[str], 280 **kwargs: Any, # verbose 281) -> DataFrame: 282 """Take a list of ABS data sheets and stitch them into a DataFrame. 283 284 Find the DataFrames for those sheets in the from_dict, and stitch them 285 into a single DataFrame with an appropriate PeriodIndex. 286 """ 287 # --- step 0: set up --- 288 verbose: bool = kwargs.get("verbose", False) 289 merged_data = DataFrame() 290 header_row: int = 8 291 292 # --- step 1: capture the time series data --- 293 # identify the data sheets in the list of all sheets from Excel file 294 data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")] 295 296 for sheet_name in data_sheets: 297 if verbose: 298 print(f"About to cature data from {sheet_name=}") 299 300 # --- capture just the data, nothing else 301 sheet_data = from_dict[sheet_name].copy() 302 303 # get the columns 304 header = sheet_data.iloc[header_row] 305 sheet_data.columns = pd.Index(header) 306 sheet_data = sheet_data[(header_row + 1) :] 307 308 # get the row indexes 309 sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose=verbose) 310 311 # --- merge data into a single dataframe 312 if len(merged_data) == 0: 313 merged_data = sheet_data 314 else: 315 merged_data = merged_data.merge( 316 right=sheet_data, 317 how="outer", 318 left_index=True, 319 right_index=True, 320 suffixes=("", ""), 321 ) 322 323 # --- step 2 - final tidy-ups 324 # remove NA rows 325 merged_data = merged_data.dropna(how="all") 326 # check for NA columns - rarely happens 327 # Note: these empty columns are not removed, 328 # but it is useful to know they are there 329 if merged_data.isna().all().any() and verbose: 330 na_cols = merged_data.columns[merged_data.isna().all()] 331 print(f"Caution: These columns are all NA: {list(na_cols)}") 332 333 # check for duplicate columns - should not happen 334 # Note: these duplicate columns are removed 335 duplicates = merged_data.columns.duplicated() 336 if duplicates.any(): 337 if verbose: 338 dup_table = abs_meta[metacol.table].iloc[0] 339 print(f"Note: duplicates removed from {dup_table}: " + f"{merged_data.columns[duplicates]}") 340 merged_data = merged_data.loc[:, ~duplicates].copy() 341 342 # make the data all floats. 343 return merged_data.astype(float).sort_index() 344 345 346def _index_to_period(sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, *, verbose: bool) -> DataFrame: 347 """Convert the index of a DataFrame to a PeriodIndex.""" 348 index_column = sheet_data[sheet_data.columns[0]].astype(str) 349 sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1) 350 long_row_names = index_column.str.len() > MAX_DATETIME_CHARS # 19 chars in datetime str 351 if verbose and long_row_names.any(): 352 print(f"You may need to check index column for {sheet_name}") 353 index_column = index_column.loc[~long_row_names] 354 sheet_data = sheet_data.loc[~long_row_names] 355 356 proposed_index = pd.to_datetime(index_column) 357 358 # get the correct period index 359 short_name = sheet_name.split(HYPHEN, 1)[0] 360 series_id = sheet_data.columns[0] 361 freq_value = abs_meta[abs_meta[metacol.table] == short_name].loc[series_id, metacol.freq] 362 freq = str(freq_value).upper().strip()[0] 363 freq = "Y" if freq == "A" else freq # pandas prefers yearly 364 freq = "Q" if freq == "B" else freq # treat Biannual as quarterly 365 if freq not in ("Y", "Q", "M", "D"): 366 print(f"Check the frequency of the data in sheet: {sheet_name}") 367 368 # create an appropriate period index 369 if freq: 370 if freq in ("Q", "Y"): 371 month = str(calendar.month_abbr[proposed_index.dt.month.max()]).upper() 372 freq = f"{freq}-{month}" 373 sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq) 374 else: 375 raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex") 376 377 return sheet_data 378 379 380def _capture_meta( 381 cat: str, 382 from_dict: dict[str, DataFrame], 383 index_sheet: str, 384) -> DataFrame: 385 """Capture the metadata from the Index sheet of an ABS excel file. 386 387 Returns a DataFrame specific to the current excel file. 388 Returning an empty DataFrame, means that the meta data could not 389 be identified. Meta data for each ABS data item is organised by row. 390 """ 391 # --- step 0: set up --- 392 frame = from_dict[index_sheet] 393 394 # --- step 1: check if the metadata is present in the right place --- 395 # Unfortunately, the header for some of the 3401.0 396 # spreadsheets starts on row 10 397 starting_rows = 8, 9, 10 398 required = metacol.did, metacol.id, metacol.stype, metacol.unit 399 required_set = set(required) 400 401 header_row = None 402 header_columns = None 403 for row in starting_rows: 404 columns = frame.iloc[row] 405 if required_set.issubset(set(columns)): 406 header_row = row 407 header_columns = columns 408 break 409 410 if header_row is None or header_columns is None: 411 print(f"Table has no metadata in sheet {index_sheet}.") 412 return DataFrame() 413 414 # --- step 2: capture the metadata --- 415 file_meta = frame.iloc[header_row + 1 :].copy() 416 file_meta.columns = pd.Index(header_columns) 417 418 # make damn sure there are no rogue white spaces 419 for col in required: 420 file_meta[col] = file_meta[col].str.strip() 421 422 # remove empty columns and rows 423 file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0) 424 425 # populate the metadata 426 file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0] 427 tab_desc_value = frame.iloc[TABLE_DESC_ROW, TABLE_DESC_COL] 428 tab_desc = str(tab_desc_value).split(".", 1)[-1].strip() 429 file_meta[metacol.tdesc] = tab_desc 430 file_meta[metacol.cat] = cat 431 432 # drop last row - should just be copyright statement 433 file_meta = file_meta.iloc[:-1] 434 435 # set the index to the series_id 436 file_meta.index = pd.Index(file_meta[metacol.id]) 437 438 return file_meta 439 440 441def _group_sheets( 442 abs_dict: dict[str, DataFrame], 443) -> dict[str, list[str]]: 444 """Group the sheets from an Excel file.""" 445 keys = list(abs_dict.keys()) 446 long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys] 447 448 def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]: 449 groups: dict[str, list[str]] = {} 450 for x, y in p_list: 451 if x not in groups: 452 groups[x] = [] 453 groups[x].append(y) 454 return groups 455 456 return group(long_pairs) 457 458 459# --- initial testing --- 460if __name__ == "__main__": 461 462 def simple_test() -> None: 463 """Test the read_abs_cat function.""" 464 # ABS Catalogue ID 8731.0 has a mix of time 465 # series and non-time series data. Also, 466 # it has unusually structured Excel files. So, a good test. 467 468 print("Starting test.") 469 470 d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False) 471 print(f"--- {len(d)=} ---") 472 print(f"--- {d.keys()=} ---") 473 for table in d: 474 freq_str = getattr(d[table].index, "freqstr", "Unknown") 475 print(f"{table=} {d[table].shape=} {freq_str=}") 476 477 print ("=" * 20) 478 479 d, _m = read_abs_cat("", zip_file=".test-data/Qrtly-CPI-Time-series-spreadsheets-all.zip", verbose=False) 480 print(f"--- {len(d)=} ---") 481 print(f"--- {d.keys()=} ---") 482 for table in d: 483 freq_str = getattr(d[table].index, "freqstr", "Unknown") 484 print(f"{table=} {d[table].shape=} {freq_str=}") 485 486 print("Test complete.") 487 488 simple_test()
27@cache # minimise slowness for any repeat business 28def read_abs_cat( 29 cat: str, 30 **kwargs: Unpack[ReadArgs], 31) -> tuple[dict[str, DataFrame], DataFrame]: 32 """For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames. 33 34 This function returns the complete ABS Catalogue information as a 35 python dictionary of pandas DataFrames, as well as the associated metadata 36 in a separate DataFrame. The function automates the collection of zip and 37 excel files from the ABS website. If necessary, these files are downloaded, 38 and saved into a cache directory. The files are then parsed to extract time 39 series data, and the associated metadata. 40 41 By default, the cache directory is `./.readabs_cache/`. You can change the 42 default directory name by setting the shell environment variable 43 `READABS_CACHE_DIR` with the name of the preferred directory. 44 45 Parameters 46 ---------- 47 cat : str 48 The ABS Catalogue Number for the data to be downloaded and made 49 available by this function. This argument must be specified in the 50 function call. 51 52 **kwargs : Unpack[ReadArgs] 53 The following parameters may be passed as optional keyword arguments. 54 55 url : str = "" 56 The URL of an ABS landing page. Use this for discontinued series 57 that are no longer in the ABS Time Series Directory. If provided, 58 data will be retrieved from this URL instead of looking up the 59 catalogue number. Example: 60 `read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")` 61 62 keep_non_ts : bool = False 63 A flag for whether to keep the non-time-series tables 64 that might form part of an ABS catalogue item. Normally, the 65 non-time-series information is ignored, and not made available to 66 the user. 67 68 history : str = "" 69 Provide a month-year string to extract historical ABS data. 70 For example, you can set history="dec-2023" to the get the ABS data 71 for a catalogue identifier that was originally published in respect 72 of Q4 of 2023. Note: not all ABS data sources are structured so that 73 this technique works in every case; but most are. 74 75 verbose : bool = False 76 Setting this to true may help diagnose why something 77 might be going wrong with the data retrieval process. 78 79 ignore_errors : bool = False 80 Normally, this function will cease downloading when 81 an error in encountered. However, sometimes the ABS website has 82 malformed links, and changing this setting is necessitated. (Note: 83 if you drop a message to the ABS, they will usually fix broken 84 links with a business day). 85 86 get_zip : bool = True 87 Download the excel files in .zip files. 88 89 get_excel_if_no_zip : bool = True 90 Only try to download .xlsx files if there are no zip 91 files available to be downloaded. Only downloading individual excel 92 files when there are no zip files to download can speed up the 93 download process. 94 95 get_excel : bool = False 96 The default value means that excel files are not 97 automatically download. Note: at least one of `get_zip`, 98 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 99 catalogue items, it is sufficient to just download the one zip 100 file. But note, some catalogue items do not have a zip file. 101 Others have quite a number of zip files. 102 103 single_excel_only : str = "" 104 If this argument is set to a table name (without the 105 .xlsx extension), only that excel file will be downloaded. If 106 set, and only a limited subset of available data is needed, 107 this can speed up download times significantly. Note: overrides 108 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 109 110 selected_excel : tuple[str, ...] = () 111 If set to a tuple of table names (without the .xlsx extension), 112 only those excel files will be downloaded. Useful when several 113 specific tables are needed and downloading the full zip would 114 be wasteful. Example: 115 `selected_excel=("62020001", "62020017", "62020X28")`. 116 Must be a tuple (not a list) because `read_abs_cat` uses an 117 internal cache that requires hashable arguments. Note: overrides 118 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only` 119 when at least one matching file is found. 120 121 single_zip_only : str = "" 122 If this argument is set to a zip file name (without 123 the .zip extension), only that zip file will be downloaded. 124 If set, and only a limited subset of available data is needed, 125 this can speed up download times significantly. Note: overrides 126 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 127 128 cache_only : bool = False 129 If set to True, this function will only access 130 data that has been previously cached. Normally, the function 131 checks the date of the cache data against the date of the data 132 on the ABS website, before deciding whether the ABS has fresher 133 data that needs to be downloaded to the cache. 134 135 zip_file: str | Path = "" 136 If set to a specific zip file name (with or without the .zip 137 extension), this function will only extract data from that zip file 138 on the local file system. This may be useful for debugging purposes. 139 140 Returns 141 ------- 142 tuple[dict[str, DataFrame], DataFrame] 143 The function returns a tuple of two items. The first item is a 144 python dictionary of pandas DataFrames (which is the primary data 145 associated with the ABS catalogue item). The second item is a 146 DataFrame of ABS metadata for the ABS collection. 147 148 Note: 149 You can retrieve non-timeseries data using the grab_abs_url() 150 function. That takes the URL for the ABS landing page for the ABS 151 collection you are interested in. The read_abs_cat function is for 152 ABS catalogue identifiers which are timeseries data, for which the 153 metadata can be extracted. 154 155 Example 156 ------- 157 158 ```python 159 import readabs as ra 160 from pandas import DataFrame 161 cat_num = "6202.0" # The ABS labour force survey 162 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 163 abs_dict, meta = data 164 ``` 165 166 """ 167 # --- get the time series data --- 168 if kwargs.get("zip_file"): 169 raw_abs_dict = grab_abs_zip(kwargs["zip_file"], **kwargs) 170 else: 171 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 172 response = _get_time_series_data(cat, raw_abs_dict, **kwargs) 173 174 if not response: 175 response = {}, DataFrame() 176 177 return response # dictionary of DataFrames, and a DataFrame of metadata
For a specific catalogue identifier, return the complete ABS Catalogue information as DataFrames.
This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.
By default, the cache directory is ./.readabs_cache/. You can change the
default directory name by setting the shell environment variable
READABS_CACHE_DIR with the name of the preferred directory.
Parameters
cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.
**kwargs : Unpack[ReadArgs] The following parameters may be passed as optional keyword arguments.
url : str = ""
The URL of an ABS landing page. Use this for discontinued series
that are no longer in the ABS Time Series Directory. If provided,
data will be retrieved from this URL instead of looking up the
catalogue number. Example:
read_abs_cat(cat="8501.0", url="https://www.abs.gov.au/.../jun-2025")
keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.
history : str = "" Provide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.
verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.
ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).
get_zip : bool = True Download the excel files in .zip files.
get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.
get_excel : bool = False
The default value means that excel files are not
automatically download. Note: at least one of get_zip,
get_excel_if_no_zip, or get_excel must be true. For most ABS
catalogue items, it is sufficient to just download the one zip
file. But note, some catalogue items do not have a zip file.
Others have quite a number of zip files.
single_excel_only : str = ""
If this argument is set to a table name (without the
.xlsx extension), only that excel file will be downloaded. If
set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip, get_excel_if_no_zip, get_excel and single_zip_only.
selected_excel : tuple[str, ...] = ()
If set to a tuple of table names (without the .xlsx extension),
only those excel files will be downloaded. Useful when several
specific tables are needed and downloading the full zip would
be wasteful. Example:
selected_excel=("62020001", "62020017", "62020X28").
Must be a tuple (not a list) because read_abs_cat uses an
internal cache that requires hashable arguments. Note: overrides
get_zip, get_excel_if_no_zip, get_excel and single_zip_only
when at least one matching file is found.
single_zip_only : str = ""
If this argument is set to a zip file name (without
the .zip extension), only that zip file will be downloaded.
If set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip, get_excel_if_no_zip, and get_excel.
cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.
zip_file: str | Path = "" If set to a specific zip file name (with or without the .zip extension), this function will only extract data from that zip file on the local file system. This may be useful for debugging purposes.
Returns
tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.
Note:
You can retrieve non-timeseries data using the grab_abs_url()
function. That takes the URL for the ABS landing page for the ABS
collection you are interested in. The read_abs_cat function is for
ABS catalogue identifiers which are timeseries data, for which the
metadata can be extracted.
Example
import readabs as ra
from pandas import DataFrame
cat_num = "6202.0" # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data