Metadata-Version: 2.1
Name: bcpandas
Version: 0.1.7
Summary: High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!
Home-page: https://github.com/yehoshuadimarsky/bcpandas
Author: Josh Dimarsky
License: UNKNOWN
Keywords: bcp mssql pandas
Platform: UNKNOWN
Classifier: Topic :: Database
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: SQL
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pandas (>=0.22)

# bcpandas

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![PyPI version](https://img.shields.io/pypi/v/bcpandas.svg)](https://pypi.org/project/bcpandas/)
[![Conda-Forge version](https://img.shields.io/conda/vn/conda-forge/bcpandas.svg)](https://anaconda.org/conda-forge/bcpandas)
[![GitHub license](https://img.shields.io/github/license/yehoshuadimarsky/bcpandas.svg)](https://github.com/yehoshuadimarsky/bcpandas/blob/master/LICENSE)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/bcpandas.svg)](https://pypi.python.org/pypi/bcpandas/)
[![PyPI status](https://img.shields.io/pypi/status/bcpandas.svg)](https://pypi.python.org/pypi/bcpandas/)
[![Awesome Badges](https://img.shields.io/badge/badges-awesome-green.svg)](https://github.com/Naereen/badges)


High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

## Quickstart

```python
In [1]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: from bcpandas import SqlCreds, to_sql, read_sql

In [2]: creds = SqlCreds(
   ...:     'my_server',
   ...:     'my_db',
   ...:     'my_username',
   ...:     'my_password'
   ...: )

In [3]: df = pd.DataFrame(
   ...:         data=np.ndarray(shape=(10, 6), dtype=int), 
   ...:         columns=[f"col_{x}" for x in range(6)]
   ...:     )

In [4]: df
Out[4]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

In [5]: to_sql(df, 'my_test_table', creds, index=False, if_exists='replace')

In [6]: df2 = read_sql('my_test_table', creds)

In [7]: df2
Out[7]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770
```

## Requirements
- BCP Utility
    - [Windows](https://docs.microsoft.com/en-us/sql/tools/bcp-utility)
- SqlCmd Utility
    - [Windows](https://docs.microsoft.com/en-us/sql/tools/sqlcmd-utility)
- python >= 3.6
- pandas

## Benchmarks
_# TODO_

## Installation
You can download and install this package from PyPI

```
pip install bcpandas
```

or from conda
```
conda install -c conda-forge bcpandas
```

## Caveats and Limitations

Here are some caveats and limitations of bcpandas. Hopefully they will be addressed in future releases
* In the `to_sql` function:
  * If `replace` is passed to the `if_exists` parameter, the new SQL table created will make the columns all of `NVARCHAR(MAX)` type.
  * If `append` is passed to the `if_exists` parameter, if the dataframe columns don't match the SQL table columns exactly by both name and order, it will fail.
  * If there is a NaN/Null in the last column of the dataframe it will throw an error. This is due to a BCP issue. See my issue with Microsoft about this [here](https://github.com/MicrosoftDocs/sql-docs/issues/2689) .
  * Bcpandas attempts to use a delimiter that is not present in the dataframe. This is because BCP does __not__ ignore delimiter characters when surrounded by quotes, unlike CSVs - see [here](https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server#characters-supported-as-terminators) in the Microsoft docs. Therefore, if all possible delimiter characters are present in the dataframe and bcpandas cannot find a delimiter to use, it will throw an error.
    * Possible delimiters are specified in `constants.py` .
* Currently the STDOUT stream from BCP and SqlCmd is not asynchronous.
* Currently this is being built with Windows in mind. Linux support is definitely easily added, it's just not in the immediate scope of the project yet. PRs are welcome.

## Motivations and Design
### Overview
Reading and writing data from pandas DataFrames to/from a SQL database is very slow using the built-in `read_sql` and `to_sql` methods, even with the newly introduced `execute_many` option. For Microsoft SQL Server, a far far faster method is to use the BCP utility provided by Microsoft. This utility is a command line tool that transfers data to/from the database and flat text files.

This package is a wrapper for seamlessly using the bcp utility from Python using a pandas DataFrame. Despite the IO hits, the fastest option by far is saving the data to a CSV file in the file system and using the bcp utility to transfer the CSV file to SQL Server. **Best of all, you don't need to know anything about using BCP at all!**

### Existing Solutions

<table>
<tr>
  <td><b>Name</b></td>
  <td><b>GitHub</b></td>
  <td><b>PyPI</b></td>
</tr>
<tr>
  <td>bcpy</td>
  <td>https://github.com/titan550/bcpy</td>
  <td>https://pypi.org/project/bcpy</td>
</tr>
<tr>
  <td>magical-sqlserver</td>
  <td>https://github.com/brennoflavio/magical-sqlserver</td>
  <td>https://pypi.org/project/magical-sqlserver/</td>
</tr>
</table>

#### bcpy
`bcpy` has several flaws:
* No support for reading from SQL, only writing to SQL
* A convoluted, overly class-based internal design
* Scope a bit too broad - deals with pandas as well as flat files

This repository aims to fix and improve on `bcpy` and the above issues by making the design choices described below.

> Much credit is due to `bcpy` for the original idea and for some of the code that was adopted and changed.

#### magical-sqlserver
`magical-sqlserver` is a library to make working with Python and SQL Server very easy. But it doesn't fit what I'm trying to do:
* No built in support for pandas DataFrame
* Larger codebase, I'm not fully comfortable with the dependency on the very heavy pymssql

### Design and Scope
The _**only**_ scope of `bcpandas` is to read and write between a pandas DataFrame and a Microsoft SQL Server database. That's it. We do _**not**_ concern ourselves with reading existing flat files to/from SQL - that introduces _way_ to much complexity in trying to parse and decode the various parts of the file, like delimiters, quote characters, and line endings. Instead, to read/write an exiting flat file, just import it via pandas into a DataFrame, and then use `bcpandas`.

The big benefit of this is that we get to precicely control all the finicky parts of the text file when we write/read it to a local file and then in the BCP utility. This lets us set library-wide defaults (maybe configurable in the future) and work with those.

For now, we are using the non-XML BCP format file type. In the future, XML format files may be added.

Both on-prem and cloud (Azure, AWS, etc.) versions of SQL Server are supported.

## Testing
Testing uses `pytest`. A local SQL Server is spun up using Docker.

## Contributing
Please, all contributions are very welcome! 

I will attempt to use the `pandas` docstring style as detailed [here](https://pandas.pydata.org/pandas-docs/stable/development/contributing_docstring.html).


