Metadata-Version: 2.1
Name: bio-datasets
Version: 0.0.2
Summary: Open-source collection of biology datasets and pre-trained embeddings.
Home-page: https://github.com/DeepChainBio/datasets
Author: InstaDeep
Author-email: a.delfosse@instadeep.com
License: Apache-2.0
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: google-cloud-storage (==1.37.1)
Requires-Dist: pandas (==1.2.4)

# bio-datasets
Open-source collection of biology datasets and pre-trained embeddings.

## Description
bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets.
For these datasets, pre-trained contextual embeddings are also available.


## Installation
Install the required dependencies with `pip install biodatasets`.

# How it works

```python
from biodatasets import list_datasets, load_dataset

print(list_datasets())

my_dataset = load_dataset('test')
X, y = my_dataset.to_npy_arrays(input_names=['peptide'], target_names=['target'])

embeddings = my_dataset.get_embeddings(variable_name="peptide", model_name="protbert", embeddings_type="cls")
```


