Metadata-Version: 2.1
Name: bisheng-unstructured
Version: 0.0.1rc1
Summary: ETLs fro LLMs
Home-page: https://github.com/dataelement/bisheng-unstructured
Author: DataElem
Author-email: contact@dataelem.com
License: Apache 2.0
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF (==1.23.2)
Requires-Dist: beautifulsoup4 (==4.12.2)
Requires-Dist: certifi (==2023.7.22)
Requires-Dist: cffi (==1.15.1)
Requires-Dist: chardet (==5.1.0)
Requires-Dist: charset-normalizer (==3.2.0)
Requires-Dist: contourpy (==1.1.0)
Requires-Dist: cryptography (==41.0.3)
Requires-Dist: cycler (==0.11.0)
Requires-Dist: ebooklib (==0.18)
Requires-Dist: emoji (==2.8.0)
Requires-Dist: et-xmlfile (==1.1.0)
Requires-Dist: filetype (==1.2.0)
Requires-Dist: fonttools (==4.42.1)
Requires-Dist: idna (==3.4)
Requires-Dist: importlib-metadata (==6.8.0)
Requires-Dist: lxml (==4.9.3)
Requires-Dist: markdown (==3.4.4)
Requires-Dist: msg-parser (==1.2.0)
Requires-Dist: nltk (==3.8.1)
Requires-Dist: numpy (==1.24.4)
Requires-Dist: olefile (==0.46)
Requires-Dist: opencv-python (==4.8.0.76)
Requires-Dist: openpyxl (==3.1.2)
Requires-Dist: pandas (==2.0.3)
Requires-Dist: pdf2image (==1.16.3)
Requires-Dist: pdfminer-six (==20221105)
Requires-Dist: pdfplumber (==0.10.2)
Requires-Dist: pillow (==10.0.0)
Requires-Dist: pydantic (==1.10.12)
Requires-Dist: pypandoc (==1.11)
Requires-Dist: pypdfium2 (==4.18.0)
Requires-Dist: python-dateutil (==2.8.2)
Requires-Dist: python-docx (==0.8.11)
Requires-Dist: python-magic (==0.4.27)
Requires-Dist: python-pptx (==0.6.21)
Requires-Dist: pytz (==2023.3)
Requires-Dist: requests (==2.31.0)
Requires-Dist: scipy (==1.10.1)
Requires-Dist: shapely (==2.0.1)
Requires-Dist: six (==1.16.0)
Requires-Dist: tabulate (==0.9.0)
Requires-Dist: tzdata (==2023.3)
Requires-Dist: urllib3 (==1.26.16)
Requires-Dist: wheel (==0.41.0)
Requires-Dist: xlrd (==2.0.1)
Requires-Dist: xlsxwriter (==3.1.2)
Requires-Dist: zipp (==3.16.2)

## What is bisheng-unstructured?

Bisheng-unstructured is an open-source unstructured data parsing library built to 
power LLM applications like pretrain, finetune, prompting engineering. 
Bisheng-unstructured makes the unstructured data porcessing more easily and provides a consistent user experience regardless of any file types.

The project is a sub-project of [bisheng](https://github.com/dataelement/bisheng).

## Key features

- High precision pdf layout parser
- High precision table structure recovering
- High precision OCR ability
- More friendly for token prossing for the visual text element, like table, list

## Quick start

### Start With Bisheng Platform

Use as a chain node [ElemUnstructureLoader](https://m7a7tqsztt.feishu.cn/wiki/VpyNwTt7ZiypbdkoPuJcn5w2nxf)

### Start with DataElem Services.

We provide a open cloud service for easily use. See [free trial](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD).

### Install bisheng-unstructured

- Install from pip: `pip install bisheng-unstructured`
- [Quick Start Guide](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD)

### Using from pre-builded image

## Documentation

For guidance on installation, development, deployment, and administration, 
check out [bisheng-unstructured Docs](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD). 

## Issues

Reporting problems, asking questions
We appreciate any feedback, questions or bug reporting regarding this project. 

User can posting [Issues](https://github.com/dataelement/bisheng/issues), 
follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve). 

For questions, we recommend posting in our community GitHub [Discussions](https://github.com/dataelement/bisheng/discussions).


## Acknowledgments

bisheng-unstructured adopts dependencies from the following:

- Thanks to [unstructured](https://github.com/Unstructured-IO/unstructured) for the main framework.


