Metadata-Version: 2.1
Name: bisheng-unstructured
Version: 0.0.2.post1
Summary: ETLs fro LLMs
Home-page: https://github.com/dataelement/bisheng-unstructured
Author: DataElem
Author-email: contact@dataelem.com
License: Apache 2.0
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: chardet ==5.1.0
Requires-Dist: filetype ==1.2.0
Requires-Dist: python-magic ==0.4.27
Requires-Dist: nltk ==3.8.1
Requires-Dist: tabulate ==0.9.0
Requires-Dist: requests ==2.31.0
Requires-Dist: urllib3 ==1.26.16
Requires-Dist: beautifulsoup4 ==4.12.2
Requires-Dist: emoji ==2.8.0
Requires-Dist: lxml ==4.9.3
Requires-Dist: python-docx ==0.8.11
Requires-Dist: numpy ==1.24.4
Requires-Dist: pandas ==2.0.3
Requires-Dist: python-dateutil ==2.8.2
Requires-Dist: pytz ==2023.3
Requires-Dist: six ==1.16.0
Requires-Dist: tzdata ==2023.3
Requires-Dist: ebooklib ==0.18
Requires-Dist: importlib-metadata ==6.8.0
Requires-Dist: markdown ==3.4.4
Requires-Dist: zipp ==3.16.2
Requires-Dist: msg-parser ==1.2.0
Requires-Dist: olefile ==0.46
Requires-Dist: pypandoc ==1.11
Requires-Dist: pdf2image ==1.16.3
Requires-Dist: pdfminer-six ==20221105
Requires-Dist: pdfplumber ==0.10.2
Requires-Dist: wheel ==0.41.0
Requires-Dist: pypdfium2 ==4.23.1
Requires-Dist: PyMuPDF ==1.23.2
Requires-Dist: opencv-python ==4.8.0.76
Requires-Dist: certifi ==2023.7.22
Requires-Dist: cffi ==1.15.1
Requires-Dist: charset-normalizer ==3.2.0
Requires-Dist: contourpy ==1.1.0
Requires-Dist: cryptography ==41.0.3
Requires-Dist: cycler ==0.11.0
Requires-Dist: fonttools ==4.42.1
Requires-Dist: idna ==3.4
Requires-Dist: scipy ==1.10.1
Requires-Dist: shapely ==2.0.1
Requires-Dist: pydantic ==1.10.12
Requires-Dist: pillow ==10.0.0
Requires-Dist: python-pptx ==0.6.21
Requires-Dist: xlsxwriter ==3.1.2
Requires-Dist: et-xmlfile ==1.1.0
Requires-Dist: openpyxl ==3.1.2
Requires-Dist: xlrd ==2.0.1
Requires-Dist: uvicorn
Requires-Dist: fastapi
Requires-Dist: orjson

## What is bisheng-unstructured?

Bisheng-unstructured is an open-source unstructured data parsing library built to 
power LLM applications like pretrain, finetune, prompting engineering. 
Bisheng-unstructured makes the unstructured data porcessing more easily and provides a consistent user experience regardless of any file types.

The project is a sub-project of [bisheng](https://github.com/dataelement/bisheng).

## Key features

- High precision pdf layout parser
- High precision table structure recovering
- High precision OCR ability
- More friendly for token prossing for the visual text element, like table, list

## Quick start

### Start With Bisheng Platform

Use as a chain node [ElemUnstructureLoader](https://m7a7tqsztt.feishu.cn/wiki/VpyNwTt7ZiypbdkoPuJcn5w2nxf)

### Start with DataElem Services.

We provide a open cloud service for easily use. See [free trial](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD).

### Install bisheng-unstructured

- Install from pip: `pip install bisheng-unstructured`
- [Quick Start Guide](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD)

### Using from pre-builded image

## Documentation

For guidance on installation, development, deployment, and administration, 
check out [bisheng-unstructured Docs](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD). 

## Issues

Reporting problems, asking questions
We appreciate any feedback, questions or bug reporting regarding this project. 

User can posting [Issues](https://github.com/dataelement/bisheng/issues), 
follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve). 

For questions, we recommend posting in our community GitHub [Discussions](https://github.com/dataelement/bisheng/discussions).


## Acknowledgments

bisheng-unstructured adopts dependencies from the following:

- Thanks to [unstructured](https://github.com/Unstructured-IO/unstructured) for the main framework.


