Metadata-Version: 2.1
Name: arcae
Version: 0.2.1
Summary: Arrow bindings for casacore
Author-Email: Simon Perkins <simon.perkins@gmail.com>
License: BSD 3-Clause License
        
        Copyright (c) 2023, Rhodes University Centre for Radio Astronomy Techniques & Technologies (RATT)
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse or promote products derived from
           this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3
Project-URL: Repository, https://github.com/ratt-ru/arcae
Requires-Dist: appdirs
Requires-Dist: click
Requires-Dist: rich
Requires-Dist: pyarrow==13.0.0
Requires-Dist: black==22.1.0; extra == "dev"
Requires-Dist: flake8==4.0.1; extra == "dev"
Requires-Dist: tbump; extra == "dev"
Requires-Dist: duckdb; extra == "test"
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: python-casacore>=3.5.0; extra == "test"
Requires-Dist: requests; extra == "test"
Provides-Extra: dev
Provides-Extra: test
Description-Content-Type: text/x-rst

C++ and Python Arrow Bindings for casacore
==========================================


Rationale
---------

* The structure of Apache Arrow Tables is highly similar to that of CASA Tables
* It's easy to convert Arrow Tables between many different languages
* Once in Apache Arrow format, it is easy to store data in modern, cloud-native disk formats such as parquet and orc.
* Converting CASA Tables to Arrow in the C++ layer avoids the GIL
* Access to non thread-safe CASA Tables is constrained to a ThreadPool containing a single thread
* It also allows us to write astrometric routines in C++, potentially side-stepping thread-safety
  and GIL issues with the CASA Measures server.


Build Wheel Locally
-------------------

In the user or, even better, a virtual environment:

.. code-block:: python

  $ pip install -U pip cibuildwheel
  $ bash scripts/run_cbuildwheel.sh -p 3.10

.. warning::
  Only linux wheels are currently supported.

Local Development
-----------------

In the directory containing the source, setup your development environment as follows:

.. code-block:: python

  $ pip install -U pip virtualenv
  $ virtualenv -p python3.10 /venv/arcaedev
  $ . /venv/arcaedev/bin/activate
  (arcaedev) export VCPKG_TARGET_TRIPLET=x64-linux-dynamic-cxx17-abi1-dbg
  (arcaedev) pip install -e .[test]
  (arcaedev) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/vcpkg/installed/$VCPKG_TARGET_TRIPLET/lib
  (arcaedev) py.test -s -vvv --pyargs arcae

Usage
-----

Example Usage:

  .. code-block:: python

    import json
    from pprint import pprint

    import arcae
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq

    # Obtain (partial) Apache Arrow Table from a CASA Table
    casa_table = arcae.table("/path/to/measurementset.ms")
    arrow_table = casa_table.to_arrow()        # read entire table
    arrow_table = casa_table.to_arrow(10, 20)  # startrow, nrow
    assert isinstance(arrow_table, pa.Table)

    # Print JSON-encoded Table and Column keywords
    pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"]))
    pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"]))

    # Extract Arrow Table columns into numpy arrays
    time = arrow_table.column("TIME").to_numpy()
    data = arrow_table.column("DATA").to_numpy()   # currently, arrays of object arrays, overly slow and memory hungry
    df = arrow_table.to_pandas()                   # currently slow, memory hungry due to arrays of object arrays

    # Write Arrow Table to parquet file
    pq.write_table(arrow_table, "measurementset.parquet")


See the test cases for further use cases.


Exporting Measurement Sets to Arrow Parquet Datasets
----------------------------------------------------

An export script is available:

.. code-block:: bash

  $ arcae export /path/to/the.ms --nrow 50000
  $ tree output.arrow/
  output.arrow/
  ├── ANTENNA
  │   └── data0.parquet
  ├── DATA_DESCRIPTION
  │   └── data0.parquet
  ├── FEED
  │   └── data0.parquet
  ├── FIELD
  │   └── data0.parquet
  ├── MAIN
  │   └── FIELD_ID=0
  │       └── PROCESSOR_ID=0
  │           ├── DATA_DESC_ID=0
  │           │   ├── data0.parquet
  │           │   ├── data1.parquet
  │           │   ├── data2.parquet
  │           │   └── data3.parquet
  │           ├── DATA_DESC_ID=1
  │           │   ├── data0.parquet
  │           │   ├── data1.parquet
  │           │   ├── data2.parquet
  │           │   └── data3.parquet
  │           ├── DATA_DESC_ID=2
  │           │   ├── data0.parquet
  │           │   ├── data1.parquet
  │           │   ├── data2.parquet
  │           │   └── data3.parquet
  │           └── DATA_DESC_ID=3
  │               ├── data0.parquet
  │               ├── data1.parquet
  │               ├── data2.parquet
  │               └── data3.parquet
  ├── OBSERVATION
  │   └── data0.parquet


This data can be loaded into an Arrow Dataset:

.. code-block:: python

    >>> import pyarrow as pa
    >>> import pyarrow.dataset as pad
    >>> main_ds = pad.dataset("output.arrow/MAIN")
    >>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")

Limitations
-----------

Some edge cases have not yet been implemented, but could be with some thought.

* Not yet able to handle columns with unconstrained rank (ndim == -1). Probably simplest to convert these rows to json and store as a string.
* Not yet able to handle TpRecord columns. Probably simplest to convert these rows to json and store as a string.
* Not yet able to handle TpQuantity columns. Possible to represent as a run-time parametric Arrow DataType.
* `to_numpy()` conversion of nested lists produces nested numpy arrays, instead of tensors.
  This is `possible <daskms_ext_types_>`_ but requires some changes to how
  `C++ Extension Types are exposed in Python <arrow_python_expose_cpp_ext_types_>`_.



Etymology
---------

Noun: **arca** f (genitive **arcae**); first declension
A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)

Pronounced: `ar-ki <arcae_pronounce_>`_.


.. _daskms_ext_types: https://github.com/ratt-ru/dask-ms/blob/1ff73ce3a60ea6479e40fc8cf440fd8d077e3d26/daskms/experimental/arrow/extension_types.py#L120-L152
.. _arrow_python_expose_cpp_ext_types: https://github.com/apache/arrow/issues/33997
.. _arcae_pronounce: https://translate.google.com/?sl=la&tl=en&text=arcae%0A&op=translate
