Metadata-Version: 2.1
Name: SDSParser
Version: 0.2.0
Summary: Extract chemical data from Safety Data Sheet documents
Home-page: https://github.com/astepe/sds_parser
Author: A. Stepe
Author-email: aris@stepe.email
License: LICENSE.txt
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: chardet (<3.1,>=3.0.4)
Requires-Dist: pdf2image (<1.5,>=1.4.0)
Requires-Dist: pdfminer.six (==20181108)
Requires-Dist: Pillow (<5.5.0,>=5.4.1)
Requires-Dist: pycryptodome (<3.8.0,>=3.7.3)
Requires-Dist: pytesseract (<0.3.0,>=0.2.6)
Requires-Dist: six (<2.0.0,>=1.12.0)
Requires-Dist: sortedcontainers (<2.2.0,>=2.1.0)
Requires-Dist: tabulate (<0.9.0,>=0.8.3)
Requires-Dist: tqdm (<4.32.0,>=4.31.1)

![alt text](https://github.com/astepe/sds_parser/blob/master/LogoSample_ByTailorBrands.jpg)

## SDSParser
SDSParser is a browser-based app for extracting chemical data from Safety Data Sheet documents. SDSParser will speed up your
data-entry process by eliminating the need to read through Safety Data Sheets to get the data you care about.

For a live demo, click here: [SDSParser](http://www.arisstepe.com/projects/submit_sds)

For testing purposes, here are some SDS files to download and use:
* [Fisher Scientific](https://www.fishersci.com/store/msds?partNumber=M2131&productDescription=MET+ISOBUTYL+KETONE+CR+ACS+1L&vendorId=VN00033897&countryCode=US&language=en)
* [Sigma Aldrich](https://www.sigmaaldrich.com/MSDS/MSDS/DisplayMSDSPage.do?country=US&language=en&productNumber=P5958&brand=SIGALD&PageToGoToURL=https%3A%2F%2Fwww.sigmaaldrich.com%2Fcatalog%2Fsearch%3Fterm%3Dpotassium%26interface%3DAll%26N%3D0%26mode%3Dmatch%2520partialmax%26lang%3Den%26region%3DUS%26focus%3Dproduct)

## Motivation
Built out of the need to quickly access chemical data from Safety Data Sheets for data-entry purposes. Each chemical manufacturer will stylize and structure their SDSs a little bit differently. SDSParser can easily be updated to read a new manufacturer format by adding a new set of regular expressions to match the format that that specific manufacturer uses.

## Tech/framework used
* [pdfminer](https://github.com/euske/pdfminer), a tool for extracting information from PDF documents
* [pytesseract](https://pypi.org/project/pytesseract/), a python wrapper for Google's Tesseract-OCR

## Features
Have some physical SDSs you need to scan and get data from? Have no fear, sds_parser will recognize your scanned file as an image and perform optical character recognition (ocr) to extract the text for you.

## How to install

`pip install SDSParser`

## How to use
Simply initialize `SDSParser` with an optional list of data fields you wish to extract (e.g. ['manufacturer', 'flash_point']) to the `request_keys` key-word argument. See `configs.SDSRegexes.REQUEST_KEYS` for the proper keys to use. If no keys are requested, all available data fields will be searched.

```
>>> from sdsparser import SDSParser
>>> request_keys = ['manufacturer', 'flash_point', 'specific_gravity', 'product_name', 'sara_311', 'nfpa_fire']
>>> parser = SDSParser(request_keys=request_keys)
```

Here is a list of the keys to use.
```
>>> from sdsparser.configs.SDSRegexes import REQUEST_KEYS
>>> REQUEST_KEYS
[
    'manufacturer',
    'product_name',
    'flash_point',
    'specific_gravity',
    'nfpa_fire',
    'nfpa_health',
    'nfpa_reactivity',
    'sara_311',
    'revision_date',
    'physical_state',
    'cas_number',
]
```

Call `parser.get_sds_data('path/to/ExampleSDS.pdf')` and pass in the path to your SDS document to get the sds data.

```
>>> sds_data = parser.get_sds_data('path/to/SafetyDataSheet.pdf')
```

`.get_sds_data` returns a dictionary object mapping request key names to their corresponding matches

```
>>> sds_data
{
 'manufacturer': 'Sigma-Aldrich',
 'product_name': 'Sodium dodecyl sulfate',
 'flash_point': '338 F',
 'specific_gravity': '3.2',
 'sara_311': 'Data not listed'
 'nfpa_fire': 'No data available'
}
```

If the heading for the requested data type is not found in the SDS, `.get_sds_data` will return the string 'Data not listed'.
If the heading is found, but no data is found under it, `.get_sds_data` will return the string 'No data available'.

## SDSParser-cli

In your terminal

```
path/to/sds/directory $ sdsparser parse --flash_point --specific_gravity
{'fisher_1.pdf': {'flash_point': 'No data available',
                  'specific_gravity': 'No data available'},
 'fisher_2.pdf': {'flash_point': 'No data available',
                  'specific_gravity': 'No data available'},
 'fisher_3.pdf': {'flash_point': 'No data available',
                  'specific_gravity': '1.84'},
 'fisher_5.pdf': {'flash_point': 'No data available',
                  'specific_gravity': 'No data available'}}
```
or
```
path/to/sds/directory $ sdsparser parse --csv
path/to/sds/directory $ cat sds_data.csv
Fisher,Data not listed,No data available,No data available,1,0,0,/312 Hazard CategoriesSee section 2 for more informationCWA (Clean Water Act)Not,26-Jan-2018,Powder,Data not listed
Fisher,"Salicylic acid, sodium salt",No data available,(etc...)
```
for more information
```
$ sdsparser --help
```
or
```
$ sdsparser parse --help
```

## License

MIT © [Aris Stepe](http://www.arisstepe.com/)


