Metadata-Version: 2.1
Name: alteruphono
Version: 0.4
Summary: Library for dealing with sound changes
Home-page: https://github.com/tresoldi/alteruphono
Author: Tiago Tresoldi
Author-email: tresoldi@shh.mpg.de
License: MIT
Description: # alteruphono
        
        [![Codacy Badge](https://api.codacy.com/project/badge/Grade/1c6218b0741d453c96c72e9504acd757)](https://app.codacy.com/manual/tresoldi/alteruphono?utm_source=github.com&utm_medium=referral&utm_content=tresoldi/alteruphono&utm_campaign=Badge_Grade_Dashboard)
        [![Build Status](https://travis-ci.org/tresoldi/alteruphono.svg?branch=master)](https://travis-ci.org/tresoldi/alteruphono)
        [![codecov](https://codecov.io/gh/tresoldi/alteruphono/branch/master/graph/badge.svg)](https://codecov.io/gh/tresoldi/alteruphono)
        
        `alteruphono` is a Python library for applying sound changes to phonetic and
        phonological representations, intended for use in simulations of language
        evolution.
        
        *Please remember that, while usable, `alteruphono` is a work-in-progress.
        The best documentation is currently to check the tests, and the
        library is not recommended for production usage.*
        
        ## Future improvements
        
        - Move from existing AST to a dictionary, mostly for speed and portability
          (even if it might be the code more verbose); should still be a frozen
          dictionary
        - Memoize `parser.__call__()` calls
        - Consider that, if a rule has alternatives, sound_classes, or other
          profilific rules in `context`, it might be necessary to
          perform a more complex merging and add back-references in
          `post` to what is matched in `ante`, which could potentially
          even mean different ASTs for forward and backward. This
          needs further and detailed investigation, or explicit
          exclusion of such rules (the user could always have the
          profilic rules in `ante` and `post`, manually doing what
          would be done here).
        - Use logging where appropriate
        - Allow different boundary symbols, including "^" and "$"
        - Add support for clusters/diphthongs
        - Add tone and other suprasegmental
        - Add custom features
        - research about kleene closures
        
        ## Installation
        
        In any standard Python environment, `alteruphono` can be installed with:
        
        ```
        pip install alteruphono
        ```
        
        ## How to use
        
        Detailed documentation can be found in the library source code and will
        be published along with the paper accompanying the library; terser
        technical description is available at the end of this document.
        Consultation of the
        [sound changes provided for testing purposes](https://github.com/tresoldi/alteruphono/blob/master/resources/sound_changes.tsv)
        is also recommended.
        
        For basic usage as a library, the `.forward()` and `.backward()` functions
        can be used as a wrapper for most common circumstances. In the
        examples below, a rule `p > t / _ V` (that is, /p/ turns into /t/ when
        followed by a vowel) is applied both in forward and backward direction
        to the `/pate/` sound sequence; the `.backward()` function correctly
        returns the two possible proto-forms:
        
        ```python
        >>> import alteruphono
        >>> alteruphono.forward("# p a t e #", "p > t / _ V")
        ['#', 't', 'a', 't', 'e', '#']
        >>> alteruphono.backward("# p a t e #", "p > t / _ V")
        [['#', 'p', 'a', 't', 'e', '#'], ['#', 'p', 'a', 'p', 'e', '#']]
        ```
        
        A stand-alone command-line tool can be used to call these wrapper
        functions:
        
        ```bash
        $ alteruphono forward '# p a t e #' 'p > t / _ V'
        # t a t e #
        $ alteruphono backward '# p a t e #' 'p > t / _ V'
        ['# p a t e #', '# p a p e #']
        ```
        
        ## Elements
        
        We are not exploring every detail of the formal grammar for annotating
        sound changes, such as the flexibility with spaces and tabulations or
        equivalent symbols for the arrows; for full information, interested parties
        can consult the reference PEG grammar and the source code.
        
        AlteruPhono operates by applying ordered lists of sound changes to
        textual representation of sound sequences.
        
        Sound changes are annotated in the `A -> B / C` syntax,
        whose constituents are
        for reference
        referred as "source" (A), "target" (B), and "context" (C), with the
        first two being mandatory; the other elements are named "arrow" and
        "slash". When applied to segment sequences, we refer to the original
        one as "ante" and to the resulting one (which might have been modified
        or not) as "post". So, with a rule "p -> b / _ a" applied to "pape":
        
        - `p` is the "source"
        - `b` is the "target"
        - `_ a` is the "context"
        - "pape" is the "ante (sequence)"
        - "bape" is the "post (sequence)"
        
        Note that, if applied backwards, a rule will have a post sequence but
        potentially more than one ante sequence. If the rule above is applied
        backwards to the post sequence "bape", as explained in the backwards
        definition and given that we have no complementary information, the
        result is a set of ante sequences "pape" and "bape".
        
        AlteruPhono operates on sound sequences expressed in standard
        [CLDF](https://cldf.clld.org/)/[LingPy](http://lingpy.org) notation,
        derived for Cysouw work,
        i.e., as a string character string with tokens separated by single spaces.
        As such, a word like the English "chance" is represented not as
        "`/tʃæns/`" or `/t͡ʃæns/`, in proper IPA notation, but as "`tʃ æ n s`".
        While the notation might at first seem strange, it has proven its
        advantages with extensive work on linguistic databases, as it not only
        facilitates data entry and inspection, but also makes no assumptions about
        what constitutes a segment, no matter how obvious the segmentation might
        look to a linguist. On one had, being agnostic in terms of the segmentation
        allows the program to operate as a "dumb" machine, and on the other allows
        researchers to operate on different kinds of segmentation if suitable for
        their research, including treating whole syllables as segments. In order
        to facilitate the potentially tedious and error-prone task of manual
        segmentation, orthographic profiles can be used as in Lexibank.
        
        
        ## Catalogs
        
        While they are not enforced and in some cases are not needed, such as
        when the system operates as a glorified search&replace, alteruphono is
        designed to operate with three main catalogs: graphemes, features, and
        segment classes.
        
        Graphemes are sequences of one or more textual characters where most
        characters are accepts (exceptions are...).
        While in most cases it will correspond
        to common transcription system such as the IPA, and in most case correspond
        to a single sound or phoneme, this is not enforced and sequence of
        characters (with the exception of a white-space, a tabulation, a forward
        slash, square and curly brackets, and an arrow) can be used to represent
        anything defined as
        a segment in a corresponding catalog. Also note that the slash notation
        of Lexibank is supported. The default catalog distributed with alteruphono
        is based on the BIPA system of clts.
        
        Features are descriptors... Default is derived from BIPA descriptors,
        mostly articulatory, but we also incluse some distinctive feature
        systems.
        
        It is not necessary for a grapheme catalog to specify the features that
        compose each grapheme, but this severly limits the kind of operations
        possible, particularly when modelling observed or plausible sound
        changes.
        
        The default catalogs are derived from BIPA... such as in examle
        
        Segment classes are just shorthards. The default distributed with AlteruPhono
        includes a number of shorthands common in the literature and mostly
        unambiguous
        
        ## Types
        
        - A **grapheme** is a sequence of one or more textual characters representing
        a segment, such as "`a`", "`kʷʰ`".
        
        - A **bundle** is an explicit listing of features and values, as defined
        in a reference, enclosed in square brackets, such as
        "`[open,front,vowel]`" or "`[-vowel]`". Features are separated by commas,
        with optional spaces, and may carry a specific value in the format
        `feature=value` with `value` being either a logical boolean ("true" or
        "false") or a numeric value; shorthands for "true" and "false" are
        defined as the operators "+" and "-"; if no "value" is provided, it defaults
        to "true" (so that `[open,front,vowel]` is internally translated to
        `[open=true,front=true,vowel=true]`). Note on back-references here
        (experimental)
        
        - A **modifier** is a bundle of feautes used to modify a basic value;
        for example, if "V" defines a segment class (see item below) of vowels,
        "V[long]" would restrict the set of matched segments to long vowels.
        
        - A **segment-class** is a short-hand to a bundle of features, as defined
        in a reference, intended to match one or more segments are expressed with
        one or more upper-case characters, such as "C" or
        "VL" (for [consonant] and [long,vowel], respectively, in the
        default). A segment class can have a modifier.
        
        - A **marker** is a single character non-segmental information. Defined
        markers are # for word-boundary, . for syllable break, + for morpheme
        boundary, stress marks and tone marks. Note that some markers,
        particularly suprasegmental features as stress and tone, in most cases
        will not be referred directly when writing rule, but by tiers. See
        section on tiers.
        
        - A **focus** is a special marker, represented by underscore, and used in
        context to indicate the position of the source and target. See reference
        when discuss contexts.
        
        - An **alternative** is a list of one or more segments (which tzype?)
        separated by a vertical bar, such "b|p". While in almost all cases of
        actual usage alternatives could be expressed by bundles (such
        "b|p" as "[plosive,bilabial]" in most inventories, using an alternative is
        in most cases preferable for legibility
        
        - A **set** is a list of alternative segments where the order is
        significant, expressed between curly brackets and separated by commas,
        such as `{a,e}`. The order is significant in the sense that, in the
        case of a corresponding set, elements will be matched by their index:
        if `{a,e}` is matched with `{ɛ,i}`, all /a/ will become /ɛ/ and all
        /e/ will become /i/ (note how, with standard IPA descriptors, it would
        not be possible to express such raising in a an unambiguos way)
        
        - A **back-reference** is a reference to a previously matched segment,
        expressed by the symbol @ and the numeric index for the segment,
        (such as @2 for referring to the second element,
        the vowel /a/, in the segment sequence "b a"). As such, back-references
        allow to carry identities: if "V s V" means any intervocalic "s" and
        "a s a" means only "s" between "a", "V s @1" means any "s" in
        intervocalic position where the two vowels are equal. Back-references
        can take modifier.
        
        
        
        ## TODO
        
        For version 2.0:
            - Implement mapper support in the automata (also with test cases)
            - Implement parentheses support in the grammar and automata (also with
              test cases)
            - Consider moving to ANTLR
            - For the grammar, consider removing direct sound match in `segment`,
              only using `alternative` (potentially renamed to `expression` and dealt
              with in an appropriate way)
            - don't collect a `context`, but `left` and `right` already in the
              AST (i.e., remove the `position` symbol)
        
            - In Graphviz output
                - Accept a strng with a description (could be the output of the
                  NLAutomata)
                - Draw borders around `source`, `target`, and `context`
                - Add indices to sequences, at least optionally
                - Accept definitions of sound classes and IPA, at least in English
        
        Old version
        
          - Use `logging` everywhere
          - Implement automatic, semi-automatic, and requested syllabification
            based on prosody strength
          - Implement both PEG grammars from separate repository
          - Add support for custom replacement functions (deciding on notation)
        
        ## Manual
        
        There are two basic elements: rules and sequences. A rule operates on
        a sequence, resulting in a single, potentially different, sequence in
        forwards direction, and in at least one, potentially different, sequence
        in backwards direction.
        
        Following the conventions and practices of CLTS, CLDF, Lingpy,
        and orthographic profiles, the
        proposed notation operates on "strings", that is, text in Unicode
        characters representing a sequence of one or more segments separated
        by spaces. The most common segments are sounds as represented by Unicode
        glyphs, so that a transcription like /haʊs/ ("house" in English Received
        Pronounciation) is represented as `"h a ʊ s"`, that is, not considering
        spaces, U+0068 (LATIN SMALL LETTER H),
        U+0061 (LATIN SMALL LETTER A),
        U+028A (LATIN SMALL LETTER UPSILON), and U+0073
        (LATIN SMALL LETTER S). The usage of spaces might seem inconventient and
        even odds at first, but the convention has proven useful with years of
        experience of phonological transcription for computer-assisted treatment, as
        not only it makes no automatic assumption of what constitutes a segment
        (for example, allowing user to work with fully atomic syllables), but
        facilitates validation work.
        
        A `rule` is a statement expressed in the `A > B / C _ D` notation, where
        C and D, both optional, express the preceding and following context.
        It is a shorthand to common notation, internally mapped to
        `C A D > B A D`. While A and B might expresses something different from
        historical evolution, such as correspondence, they are respectively named
        `ante` and `post`, and the rule can be real as "the sequence of segments
        A changes into the sequence of sounds B when preceded by C and followed by
        D".
        A, B, and C are referred as as "sequences", and are composed of one or
        more "segments". A "segment" is the basic, fundamental, atomic unit of a
        sequence.
        
        Segments can be of X types:
        
          - sound segments, such as phonemes (like `a` or `ʒ`) or whatever is
            defined as an atomic segment by the used (for example, full-length
            syllables such as `ba` or `ʈ͡ʂʰjou̯˨˩˦`). In most cases, a phonetic or
            phonological transcription system such IPA or NAPA will be used; by
            default, the system operates on BIPA, which also facilitates
            normalization in terms of homoglyphs, etc.
          - A bundle of features, expressed as comma separated feature-values
            enclosed by square brackets, such as `[consonant]`, referring to all
            consonants, `[unrounded,open-mid,central,vowel]`, referring to all
            sounds matching this bundle of features (that is, `ɜ` and the same
            sound with modifiers), etc. Complex relationships and tiers allow to
            expressed between segments, as described later. By default, the system
            of descriptors used by BIPA is used.
          - Sound-classes, which are common short-hands for bundles of features,
            such as `K` for `[consonant,velar]` or `R` for "resonants" (defined
            internally as `[consonant,-stop]`). A default system, expressed in
            table X, is provided, and can be replaced, modified, or extended by the
            user. Sound-classes are expressed in all upper-case.
          - Back-references, used to refer to other segments in a sequence,
            which are expressed by the at-symbol (`@`) and a
            numeric index, such as `@1` or `@3` (1-based). These will are better
            explored in X.
          - Special segments related to sequences, which are
            - `_` (underscore) for the "focus" in a context (from the name by
              Hartman 2003), that is, the position where `ante` and `post` sequences
              are found
            - `#` (hash) for word boundaries
            - `.` (dot) for syllable breaks
        
        Sound segments, sound-classes, and back-references can carry a modifier,
        which is following bundle of features the modifies the value expressed or
        referred to. For example `θ[voiced]` is equivalent to `ð`, `C[voiceless]`
        would match only voiceless consonants, `C[voiceless] ə @1[voiced]` would
        match sequences of voiceless consonants, followed by a schwa, followed by
        the corresponding voiced consonant (thus matching sequences like `p ə b`
        and `k ə g`, but not `p ə g`).
        
        Other non primitives include alternatives and sets.
        
        ## How to cite
        
        If you use `alteruphono`, please cite it as:
        
        > Tresoldi, Tiago (2020). Alteruphono, a tool for simulating sound changes.
        Version 0.3. Jena. Available at: https://github.com/tresoldi/alteruphono
        
        In BibTex:
        
        ```
        @misc{Tresoldi202alteruphono,
          author = {Tresoldi, Tiago},
          title = {Alteruphono, a tool for simulating sound changes. Version 0.3.},
          howpublished = {\url{https://github.com/tresoldi/alteruphono}},
          address = {Jena},
          year = {2020},
        }
        ```
        
        ## Author
        
        Tiago Tresoldi (tresoldi@shh.mpg.de)
        
        The author was supported during development by the
        [ERC Grant #715618](https://cordis.europa.eu/project/rcn/206320/factsheet/en)
        for the project [CALC](http://calc.digling.org)
        (Computer-Assisted Language Comparison: Reconciling Computational and Classical
        Approaches in Historical Linguistics), led by
        [Johann-Mattis List](http://www.lingulist.de).
        
Keywords: sound change,phonology,phonetics,Lautwandel
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Description-Content-Type: text/markdown
