Metadata-Version: 2.1
Name: RTFDE
Version: 0.0.2
Summary: A library for extracting HTML content from RTF encapsulated HTML as commonly found in the exchange MSG email format.
Home-page: https://github.com/seamustuohy/RTFDE
Author: seamus tuohy
Author-email: code@seamustuohy.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Communications :: Email :: Filters
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: lark-parser (>=0.11)
Requires-Dist: oletools (>=0.56)
Provides-Extra: dev
Requires-Dist: lxml (>=4.6) ; extra == 'dev'
Provides-Extra: msg_parse
Requires-Dist: extract-msg (>=0.27) ; extra == 'msg_parse'

# RTFDE: RTF De-Encapsulator

A python3 library for extracting encapsulated `HTML` & `plain text` content from the `RTF` bodies of .msg files.

De-encapsulation enables previously encapsulated HTML and plain text content to be extracted and rendered as HTML and plain text instead of the encapsulating RTF content. After de-encapsulation, the HTML and plain text should differ only minimally from the original HTML or plain text content.

# Features

- De-encapsulate HTML from RTF encapsulated HTML.
- De-encapsulate plain text from RTF encapsulated text.

# Known Issues

- This library *fully* unquotes text it de-encapsulates because it does not know which text was quoted in the RTF conversion process and which text was quoted in the original html/text. So, for instance escaped [Quoted-Printable](https://en.wikipedia.org/wiki/Quoted-printable) text will be returned un-escaped.
- This library currently can't [combine attachments](https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfex/b518f0bc-468c-4218-87a7-8f8859bf5773) from a .MSG Message object with the de-encapsulated HTML. This is mostly because I could not get a good set of examples of encapsulated HTML which had attachment objects that needed to be integrated back into the body of the HTML.

# Anti-Features (I don't intend to have this library do this.)

- Extract plain text from RTF encapsulated HTML. If you want this, then you will have to parse the HTML using another library.

# Installation

**To install from the pip package.**

```
pip3 install RTFDE

```

# Usage

## De-encapsulating HTML or TEXT

```python
from RTFDE.deencapsulate import DeEncapsulator

with open('rtf_file', 'r') as fp:
    raw_rtf  = fp.read()
    rtf_obj = DeEncapsulator(raw_rtf)
    rtf_obj.deencapsulate()
    if rtf_obj.content_type == 'html':
        print(rtf_obj.html)
    else:
        print(rtf_obj.text)
```

# Contribute

Please check the [contributing guidelines](./CONTRIBUTING.md)

# License

Please see the [license file](./LICENSE) for license information on RTFDE. If you have further questions related to licensing PLEASE create an issue about it on github.


