Metadata-Version: 1.1
Name: Gutenberg
Version: 0.1.1
Summary: Project Gutenberg corpus interface
Home-page: https://bitbucket.org/c-w/gutenberg/
Author: Clemens Wolff
Author-email: clemens.wolff+pypi@gmail.com
License: LICENSE.txt
Download-URL: http://pypi.python.org/pypi/Gutenberg
Description: *********
        Gutenberg
        *********
        
        
        Overview
        ========
        
        This package contains a variety of scripts to make working with the `Project
        Gutenberg <http://www.gutenberg.org>`_ body of public domain texts easier.
        
        The functionality provided by this package includes:
        
        * Downloading texts using the Project Gutenberg API.
        * Cleaning up the texts: removing headers and footers.
        * Making meta-data about the texts easily accessible through a database.
        
        
        Installation
        ============
        
        First of all, you should probably install the dependencies for the package and
        verify your install.
        
        * The recommended way of doing this is using the project's makefile. The
          command ``make virtualenv`` will install all the required dependencies for
          the package in a local directory called ``virtualenv``.
        * You might want to run the tests to see if everything installed correctly:
          ``make test``.
        * Now run ``source virtualenv/bin/activate`` and you're good to go.
        
        Another setup task you might want to run is ``make docs`` to automatically
        generate some API documentation for the project. After running the command, you
        can enjoy your documentation by pointing your browser at
        ``docs/_build/html/index.html``.
        
        
        Usage
        =====
        
        Now that you're all set-up, let's get started. There are two ways to use this
        project: from the command line and as a Python module.
        
        
        From the command line
        ---------------------
        
        The following functionality is available via the command line:
        
        * Download some texts: ``python -m gutenberg.download``.
        * Clean up a downloaded text: ``python -m gutenberg.beautify``.
        * Grab some meta-data for the texts: ``python -m gutenberg.metainfo``.
        
        For example, to download 10MB of texts to the directory `corpus`, you could run
        the following::
        
            python -m gutenberg.download ./corpus --limit=10MB
        
        You can find out more about how to run the scripts by appending ``--help`` to
        the commands listed above.
        
        
        As a module
        -----------
        
        You can also use the project as a Python package. The following snippet
        demonstrates how you'd download some texts from Project Gutenberg and then
        iterate over your freshly built corpus::
        
            from gutenberg import GutenbergCorpus
        
            # this will setup the corpus and download 10MB worth of text to ./corpus
            corpus = GutenbergCorpus()
            corpus.download(limit=10 * 10**6)
        
            # iterate over the corpus
            preview_chars = 25
            for author in corpus.authors():
                for work in author.works():
                    text = work.fulltext
                    print(u'The first {preview_chars} characters of '
                           '"{title}" by "{author}" are:\n\t"{preview}"\n'
                           .format(preview_chars=preview_chars,
                                   title=work.title,
                                   author=author.name,
                                   preview=text.replace('\n', ' ')[:preview_chars]))
        
        You can also easily drill down on specific texts and authors::
        
            shakespeare = corpus[u'Shakespeare, William']
        
            # list all the works for the author that we have currently available
            work_names = shakespeare.work_names()
            for work_num, title in enumerate(shakespeare.work_names(), start=1):
                print(u'Work {work_num} in the Shakespeare corpus: "{title}"'
                      .format(work_num=work_num,
                              title=title))
        
            # inspect a particular text
            hamlet = shakespeare[u'Hamlet'].fulltext
            to_be_or_not_to_be = u'To be, or not to be, that is the Question'
            print(u'The famous quote "{quote}" is in Hamlet at position {position}.'
                    .format(quote=to_be_or_not_to_be,
                            position=hamlet.find(to_be_or_not_to_be)))
        
        All the loading of the heavy stuff is done lazily so you can just iterate over
        authors and works at your heart's content without worrying about running out of
        memory.
        
        
        Advanced usage
        ==============
        
        You can influence how the corpus object behaves via specifying a configuration
        file when constructing the object: ``corpus =
        GutenbergCorpus.using_config('my-corpus.cfg')``. A configuration file can be
        generated from a corpus object by using
        ``corpus.write_config('path-to-config.cfg')``.
        
        The default configuration looks like this::
        
            [download]
            data_path = corpus/rawdata  # storage location of the raw Gutenberg texts
            offset = 0  # start downloading from this result page
        
            [database]
            database = corpus/gutenberg.db3  # storage location of the corpus DB
            drivername = sqlite  # the type of database to use for the corpus DB
        
            [metadata]
            metadata = corpus/metadata.json.gz  # storage location of the metadata DB
        
        More information on the different configuration options can be found in the API
        documentation of the ``gutenberg.gutenberg`` package.
        
        The corpus database stores information about the downloaded texts. The database
        has a single table, ``etexts``, with four columns: ``etextno``, ``title``,
        ``author`` and ``path``. The first column is the primary key of the table and
        represents the unique identifier of the work in the Project Gutenberg corpus.
        The remaining columns record meta-data about the work (in unicode) and a
        relative path to the raw text on disk.
        
        
        Limitations
        ===========
        
        This project *deliberately* does not include any natural language processing
        functionality. Consuming and processing the text is the responsibility of the
        client; this library merely focuses on offering a simple and easy to use
        interface to the works in the Project Gutenberg corpus.  Any linguistic
        processing can easily be done client-side e.g. using the `TextBlob
        <http://textblob.readthedocs.org>`_ library.
        
Platform: UNKNOWN
