Metadata-Version: 2.1
Name: babypandas
Version: 0.1.7
Summary: A restricted Pandas API
Home-page: https://github.com/afraenkel/babypandas
Author: Aaron Fraenkel, Darren Liu
Author-email: afraenkel@ucsd.edu
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas (>=0.24)
Requires-Dist: numpy

# babypandas

A `pandas` data-analysis library with a restricted API

[![Build Status](https://travis-ci.com/afraenkel/babypandas.svg?branch=master)](https://travis-ci.com/afraenkel/babypandas)
[![Documentation Status](https://readthedocs.org/projects/babypandas/badge/?version=latest)](https://babypandas.readthedocs.io/en/latest/?badge=latest)

---

The `pandas` library is a confusing mess of methods, and for every
task, no matter how simple, there are multiple ways of approaching
it. `babypandas` is a simplified, introductory `pandas` library that
 allows for basic tabular data-analysis with only a small subset of
 methods and arguments. This allows for a smooth transition into `pandas`.
 
 The chosen methods are meant to align with the methods in Berkeley's
 `datascience` module, developed for the [data8](https://data8.org)
 course. However, unlike the `datascience` module, all code written in
 `babypandas` is also valid `pandas` code.
  
---

## Install

To install `babypandas`, use `pip`:
```
pip install babypandas
```

---

## Documentation

See the [documentation](https://babypandas.readthedocs.io) page.

---

## FAQ

*Who is this library for?*

This library is intended for those wanting an introduction to data
 science in python, but want a focused, introduction much like what's
 covered in Berkeley's data8 course. The pandas methods available in
 this library encourage better Pandas usage through functional
 programming  patterns and method chaining.
  
*Why not just use the datascience module?*

This library may be prefered over `datascience` when students
 will be moving to `pandas`. While this library serves as a restricted
 introduction to `pandas`, it doesn't shy away from some `pandas`
 usage patterns that may require care for new programmers:
 * The frequent use of named function arguments,
 * The use of boolean arrays (masks) to select rows,
 * The use of table indices.

*How does this library compare to the datascience module?*

Berkeley `datascience` module equivalents with `babypandas`:

| `datascience` method               		  | `babypandas` equivalent or close                           | method description                        |
|---------------------------------------------|------------------------------------------------------------|-------------------------------------------|
| `Table()`                            		  | `bpd.DataFrame()`                                          | empty table formation                     |
| `Table().with_columns(*labels_and_values)`  | `bpd.DataFrame().assign(**kwargs)`                         | table from lists                          |
| `table.with_columns(*labels_and_values)`    | `df.assign(**kwargs)`                                      | adding columns                            |
| `table.with_rows(rows)`                     | `df.append(other_df, ignore_index=True)`                   |                                           |
| `Table.read_table(filepath)`                | `bpd.read_csv(filepath)`                                   | read in data                              |
| `table.num_columns`                         | `df.shape[1]`                                              | number of columns                         |
| `table.num_rows`                    		  | `df.shape[0]`                                              | number of rows                            |
| `table.labels`                      		  | `df.columns`                                               | list of columns                           |
| `table.relabeled(label, new_label)` 		  | `df.assign(new_label=df.get(label)).drop(columns=[label])` | rename columns                            |
| `table.column(col)`                 		  | `df.get(col)`                                     		   | get a specific column (by name)           |
| `table.column(col).item(0)`         		  | `df.get(col).iloc[0]`                                      | get a specific value in the table         |
| `table.select(col1, col2)`          	      | `df.get([col1, col2])`                                     | get columns as a df                       |
| `table.drop(col1, col2)`            		  | `df.drop(columns=[col1, col2])`                            | drop columns                              |
| `table.sort(col)`                  		  | `df.sort_values(by=col)`                                   | sorts values in a dataframe by col        |
| `table.take(row_indices_or_slice)`          | `df.take(row_indices_or_slice)`                            | selects a single row                      |
| `table.where(col, are.above(num))`   		  | `df.loc[df.get(col) > num]`                                | selects rows based on condition           |
| `table.scatter(xcol, ycol)`                 | `df.plot(kind='scatter', x=xcol, y=ycol)`                  | plots a scatter plot                      |
| `table.plot(xcol, ycol)`                    | `df.plot(x=xcol, y=ycol)`                                  | plots a line plot                         |
| `table.barh(col)`                           | `df.plot(kind='barh', x=col)`                              | plots a horizontal bar plot               |
| `table.hist(col, bins)`                     | `df.get(col).plot(kind='hist', bins=bins)`                 | plots a histogram                         |
| `table.apply(fn, col)`                      | `df.get(col).apply(fn)`                                    | apply function to a column                |
| `table.group(col)`                   		  | `df.groupby(col).count()`                                  | give counts of values in a col            |
| `table.group(col, agg_fn)`           		  | `df.groupby(col).agg_fn.reset_index()`                     | groups by column, aggregates with fn      |
| `table.group([col1, col2])`          		  | `df.groupby([col1, col2]).count().reset_index()`           | groups by two cols, agg with counts       |
| `table.group([col1, col2], sum)`     		  | `df.groupby[col1, col2]).sum().reset_index()`              | groups by two cols, agg with sum          |
| `table.join(leftcol, df2, rightcol)` 		  | `df.merge(df2, left_on=leftcol, right_on=rightcol)`        | merges two dataframes (diff col names)    |
| `table.join(col, df2, col)`          		  | `df.merge(df2, on=col)`                                    | merges two dataframes (same col names)    |
| `table.sample(n)`                    		  | `df.sample(n, replace=True)`                               | sample with replacement                   |
| `sample_proportions(size, distr)`    		  | `np.random.multinomial(size, distr) / size`                | gets sample proportions of a distribution |
