Metadata-Version: 2.1
Name: Effulge
Version: 0.0.1
Summary: A small package used to find data variances
Home-page: https://github.com/Regish/Effulge
Author: Regish
Author-email: regishdhanush@gmail.com
Project-URL: Bug Tracker, https://github.com/Regish/Effulge/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyspark (>=3.0.0)
Requires-Dist: pandas (>=1.3.0)
Requires-Dist: XlsxWriter (>=3.0.3)

# Effulge

## Use Case
- When we have two pyspark dataframes with valid Primary Key
- and we need to find the attributes that are mismatching between the two dataframes.
-----
## Example
Lets consider two dataframes "**expectation**" and "**reality**".
### *Inputs*
If Primary Key is (*ProductID, Colour*)
and if the contents of "**expectation**" are -
<table border=solid>
  <tr><th style="background-color:lightgreen">ProductID</th><th>ProductName</th><th style="background-color:lightgreen">Colour</th><th>UnitPrice</th><th>Quantity</th><th>Fragile</th><th>Gift</th></tr>
  <tr><td>1001</td><td>GelPen</td><td>Blue</td><td>10</td><td>2</td><td>0</td><td>0</td></tr>
  <tr><td>1001</td><td>GelPen</td><td>Black</td><td>10</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1001</td><td>GelPen</td><td>Red</td><td>10</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1002</td><td>InkPen</td><td>Blue</td><td>50</td><td>1</td><td>0</td><td>1</td></tr>
  <tr><td>1003</td><td>InkBottle</td><td>Blue</td><td>35</td><td>1</td><td>1</td><td>1</td></tr>
  <tr><td>1004</td><td>Pencil</td><td>Grey</td><td>3</td><td>5</td><td>0</td><td>0</td></tr>
  <tr><td>1005</td><td>Eraser</td><td>White</td><td>2</td><td>2</td><td>0</td><td>0</td></tr>
  <tr><td>1006</td><td>Sharpner</td><td>Orange</td><td>3</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1006</td><td>Sharpner</td><td>Steel</td><td>5</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1007</td><td>GeometryBox</td><td>Green</td><td>40</td><td>0</td><td>0</td><td></td></tr>
</table>

And if contents of "**reality**" are -
<table border=solid>
  <tr><th style="background-color:lightgreen">ProductID</th><th>ProductName</th><th style="background-color:lightgreen">Colour</th><th>UnitPrice</th><th>Quantity</th><th>Fragile</th><th>Gift</th></tr>
  <tr><td>1001</td><td>GelPen</td><td>Blue</td><td>10</td><td>2</td><td>0</td><td>0</td></tr>
  <tr><td>1001</td><td>GelPen</td><td>Black</td><td>10</td><td style="background-color:yellow">7</td><td>0</td><td>0</td></tr>
  <tr><td>1001</td><td>GelPen</td><td>Red</td><td>10</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1002</td><td>InkPen</td><td>Blue</td><td>50</td><td>1</td><td>0</td><td>1</td></tr>
  <tr><td>1003</td><td>InkBottle</td><td>Blue</td><td style="background-color:yellow">3</td><td>1</td><td>1</td><td style="background-color:yellow">0</td></tr>
  <tr style="background-color:orange"><td>1003</td><td>WaterBottle</td><td>Blue</td><td>20</td><td>2</td><td>0</td><td>0</td></tr>
  <tr><td>1004</td><td>Pencil</td><td>Grey</td><td>3</td><td>5</td><td>0</td><td>0</td></tr>
  <tr><td>1005</td><td>Eraser</td><td style="background-color:yellow">Whiteee</td><td>2</td><td>2</td><td>0</td><td>0</td></tr>
  <tr><td>1006</td><td>Sharpner</td><td>Orange</td><td>3</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1006</td><td>Sharpner</td><td>Steel</td><td>5</td><td>1</td><td>0</td><td>0</td></tr>
  <tr><td>1007</td><td>GeometryBox</td><td>Green</td><td>40</td><td>0</td><td style="background-color:yellow">1</td><td></td></tr>
</table>

### *Output*
Then, Effulge will produce an output dataframe with following contents -
<table border=solid>
  <tr><th style="background-color:lightgreen">productid</th><th style="background-color:lightgreen">colour</th><th>EFFULGE_VARIANCE_PROVOKER</th></tr>
  <tr><td>1007</td><td>Green</td><td>[fragile]</td></tr>
  <tr><td>1001</td><td>Black</td><td>[quantity]</td></tr>
  <tr><td>1003</td><td>Blue</td><td>[fragile, gift, productname, quantity, unitprice]</td></tr>
  <tr><td>1003</td><td>Blue</td><td>[gift, unitprice]</td></tr>
  <tr><td>1005</td><td>White</td><td>[MISSING_PRIMARY_KEY]</td></tr>
  <tr><td>1003</td><td>Blue</td><td>[DUPLICATE_PRIMARY_KEY]</td></tr>
</table>

-----

## Usage:

```python
from effulge import spot_variance

# Initialize SparkSession
# Load data into dataframes, let's say they are called "df_expectation" and "df_reality"
# Declare a tuple with valid primary key, let's say it is called "primary_key"

output = spot_variance(df_expectation, df_reality, primary_key)
output.show()


# to generate variance report
from effulge import save_variance_report
save_variance_report(
    variance_df=output, source_df=df_expectation, target_df=df_reality,
    super_key=primary_key, file_name="effulge_variance_report",
    src_prefix='SRC', tgt_prefix='TGT'
)

```
-----
