Metadata-Version: 2.1
Name: apple-peeler
Version: 0.1.1
Summary: Extract XML from the OS X dictionaries.
Home-page: https://github.com/solarmist/apple-peeler
License: MIT
Keywords: osx,dictionary,xml
Author: Joshua Olson
Author-email: joshua+github@solarmist.net
Maintainer: Joshua Olson
Maintainer-email: joshua+github@solarmist.net
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: beautifulsoup4 (>=4.10.0,<5.0.0)
Requires-Dist: click (>=8.0.1,<9.0.0)
Requires-Dist: lxml (>=4.6.3,<5.0.0)
Description-Content-Type: text/markdown

# Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

# Installation

    pip install apple-peeler

# Dependencies

[BeautifulSoup 4](https://beautiful-soup-4.readthedocs.io/en/latest/), [lxml](https://lxml.de), and [click](https://click.palletsprojects.com/en/8.0.x/)

# Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell `apple-peeler` where to look by exporting `DICT_BASE` in your environment or using the `--base` option below.

    export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

    Usage: apple-peeler [OPTIONS]

    Extract XML from Apple Dictionary files.

    Options:
    --base DIRECTORY                The root directory of the OS X dictionaries.
                                    (Default: /System/Library/AssetsV2/com_apple
                                    _MobileAsset_DictionaryServices_dictionaryOS
                                    X/) [Env var DICT_BASE]
    --out DIRECTORY                 The path to place extracted XML files.
    -d, --dictionary [
        all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
        Dutch - English|French|French - English|French - German|German - English|
        Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
        Italian - English|Korean|Korean - English|New Oxford American Dictionary|
        Norwegian|Oxford American Writer's Thesaurus|
        Oxford Dictionary of English|Oxford Thesaurus of English|
        Polish - English|Portuguese|Portuguese - English|Russian|
        Russian - English|Sanseido Super Daijirin|
        Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
        Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
        Spanish - English|Swedish|Thai|Thai - English|
        The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
        Traditional Chinese - English|Turkish|Vietnamese - English]
                                    The dictionary to extract or 'all'.
                                    (Default: all) [Accepts multiple]
    --format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                    (Default: False)
    --debug                         Output debug information to STDERR.
                                    (Default: False)
    --help                          Show this message and exit.

## Introduction

I need a ton of dictionary data for prototyping my learning a language tool, [Parsnip](https://solarmist.net/), and licensing 40 dictionaries seems too expensive for a bootstrapper working on an MVP (I look forward to the day this is no longer true).

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. [JMDict](https://www.edrdg.org/wiki/index.php/JMdict-EDICT_Dictionary_Project) is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

## References

This is inspired by
[Reverse-Engineering Apple Dictionary](https://fmentzer.github.io/posts/2020/dictionary/).
And the discussion on Hacker News
[Hacker News: Reverse-Engineering Apple Dictionary (2020)](https://news.ycombinator.com/item?id=28505406). Special thanks to tim-- and enragedcacti who introduced me to `binwalk`. And dunham who mentioned the random bytes looking like `int`s of payload sizes.

Additionally, I've found these posts informative:

- https://developer.apple.com/library/archive/documentation/UserExperience/Conceptual/DictionaryServicesProgGuide/prepare/prepare.html#//apple_ref/doc/uid/TP40006152-CH3-SW7
- https://jadedtuna.github.io/apple-dictionary/
- https://josephg.com/blog/reverse-engineering-apple-dictionaries/
- https://josephg.com/blog/apple-dictionaries-part-2/
- https://gist.github.com/josephg/5e134adf70760ee7e49d

