Metadata-Version: 2.1
Name: babe
Version: 0.0.7
Summary: Data access and analysis of baby names statistics
Home-page: https://github.com/thorwhalen/babe
Author: Thor Whalen
License: mit
Description: # babe
        
        Note that the first time you import name, you need to have access to the Internet, and it will take a few seconds (depending on bandwidth) to download the required data.
        
        But this data is automatically saved in a local file so things are faster the next time around.
        
        To install:
        
        ```pip install babe```
        
        Then in a python console or notebook...
        
        
        ```python
        from babe import UsNames
        
        d = UsNames()
        ```
        
        # Intro to the data
        
        The fundamental data provides a popularity score (number of babies recorded) associated to a `(state, gender, name, year)` tuple (that has data -- for names of babies born in the US between 1910 and 2019).
        
        
        ```python
        d.data
        ```
        
        
        <table border="1" class="dataframe">
          <thead>
            <tr style="text-align: right;">
              <th></th>
              <th>state</th>
              <th>gender</th>
              <th>year</th>
              <th>name</th>
              <th>popularity</th>
              <th>name_g</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <th>0</th>
              <td>AK</td>
              <td>F</td>
              <td>1910</td>
              <td>Mary</td>
              <td>14</td>
              <td>Mary_F</td>
            </tr>
            <tr>
              <th>1</th>
              <td>AK</td>
              <td>F</td>
              <td>1910</td>
              <td>Annie</td>
              <td>12</td>
              <td>Annie_F</td>
            </tr>
            <tr>
              <th>2</th>
              <td>AK</td>
              <td>F</td>
              <td>1910</td>
              <td>Anna</td>
              <td>10</td>
              <td>Anna_F</td>
            </tr>
            <tr>
              <th>3</th>
              <td>AK</td>
              <td>F</td>
              <td>1910</td>
              <td>Margaret</td>
              <td>8</td>
              <td>Margaret_F</td>
            </tr>
            <tr>
              <th>4</th>
              <td>AK</td>
              <td>F</td>
              <td>1910</td>
              <td>Helen</td>
              <td>7</td>
              <td>Helen_F</td>
            </tr>
            <tr>
              <th>...</th>
              <td>...</td>
              <td>...</td>
              <td>...</td>
              <td>...</td>
              <td>...</td>
              <td>...</td>
            </tr>
            <tr>
              <th>28277</th>
              <td>WY</td>
              <td>M</td>
              <td>2019</td>
              <td>Theo</td>
              <td>5</td>
              <td>Theo_M</td>
            </tr>
            <tr>
              <th>28278</th>
              <td>WY</td>
              <td>M</td>
              <td>2019</td>
              <td>Tristan</td>
              <td>5</td>
              <td>Tristan_M</td>
            </tr>
            <tr>
              <th>28279</th>
              <td>WY</td>
              <td>M</td>
              <td>2019</td>
              <td>Vincent</td>
              <td>5</td>
              <td>Vincent_M</td>
            </tr>
            <tr>
              <th>28280</th>
              <td>WY</td>
              <td>M</td>
              <td>2019</td>
              <td>Warren</td>
              <td>5</td>
              <td>Warren_M</td>
            </tr>
            <tr>
              <th>28281</th>
              <td>WY</td>
              <td>M</td>
              <td>2019</td>
              <td>Waylon</td>
              <td>5</td>
              <td>Waylon_M</td>
            </tr>
          </tbody>
        </table>
        <p>6122890 rows × 6 columns</p>
        </div>
        
        
        
        
        ```python
        print(f"{len(d.names)} unique names")
        ```
        
            31862 unique names
        
        
        But some names can be used for both genders, so most of the internals will use `name_g`, the name concatenated with the gender (`_F` or `_M`):
        
        
        ```python
        print(f"{len(d.name_gs)} unique names_g (gendered names)")
        ```
        
            34952 unique names_g (gendered names)
        
        
        You can use `resolve_name_g` to get the `name_g` corresponding to a name as long as the name isn't used for more than one gender.
        
        
        ```python
        d.resolve_name_g('Cora')
        ```
        
        
        
        
            'Cora_F'
        
        
        
        
        ```python
        try:
            d.resolve_name_g('Vanessa')
        except AssertionError as e:
            print(e)
        ```
        
            The Vanessa can be used for both genders. Specify Vanessa_F or Vanessa_M
        
        
        ## by_state data
        
        In some cases, it's more convenient to have a view indexed by `(state, name_g, year)`. 
        The `by_state` attribute provides that.
        
        
        ```python
        d.by_state
        ```
        
        
        
        
            state  name_g      year
            AK     Mary_F      1910    14
                   Annie_F     1910    12
                   Anna_F      1910    10
                   Margaret_F  1910     8
                   Helen_F     1910     7
                                       ..
            WY     Theo_M      2019     5
                   Tristan_M   2019     5
                   Vincent_M   2019     5
                   Warren_M    2019     5
                   Waylon_M    2019     5
            Name: popularity, Length: 6122890, dtype: int64
        
        
        
        This allows one to do things like getting the data for a given state only:
        
        
        ```python
        d.by_state['CA']
        ```
        
        
        
        
            name_g      year
            Mary_F      1910    295
            Helen_F     1910    239
            Dorothy_F   1910    220
            Margaret_F  1910    163
            Frances_F   1910    134
                               ... 
            Zayvion_M   2019      5
            Zeek_M      2019      5
            Zhaire_M    2019      5
            Zian_M      2019      5
            Ziyad_M     2019      5
            Name: popularity, Length: 387781, dtype: int64
        
        
        
        ... within a state, getting the 'by year popularity' for a given name:
        
        
        ```python
        d.by_state['CA']['Cora_F']  # or d.by_state['CA', 'Cora_F']
        ```
        
        
        
        
            year
            1911      8
            1912      9
            1913     15
            1914     15
            1915     17
                   ... 
            2015    269
            2016    244
            2017    284
            2018    282
            2019    256
            Name: popularity, Length: 109, dtype: int64
        
        
        
        ... if you wanted to get the data for a given name (really `name_g`), for all states, you can do it using "slicing". 
        
        For example, if you're wondering how many little boys were called "Vanessa", and more specifically, when and where?...
        
        
        ```python
        d.by_state[:, 'Vanessa_M'] 
        ```
        
        
        
        
            state  year
            AZ     1988     8
            CA     1980     7
                   1981     5
                   1982    20
                   1983    19
                   1984    14
                   1985    12
                   1986    13
                   1987    13
                   1988    26
                   1989    17
                   1990    16
                   1991    18
                   1992    17
                   1993    17
                   1994    10
                   1995     9
                   1996    10
                   1997    11
                   1998     7
            DC     1989    11
            NY     1982     5
                   1983     9
                   1986     6
                   1988     6
                   1989     6
            TX     1981     5
                   1982     7
                   1983    12
                   1984     9
                   1985    10
                   1986     8
                   1987     9
                   1988     8
                   1989     5
                   1990     6
                   1991     5
                   1992     5
                   1994     5
            Name: popularity, dtype: int64
        
        
        
        ## national data
        
        A national aggregation is available through the `national` attribute
        
        
        ```python
        d.national
        ```
        
        
        
        
            name_g      year
            Aaban_M     2013     6
                        2014     6
            Aadam_M     2019     6
            Aadan_M     2008    12
                        2009     6
                                ..
            Zyriah_F    2013     7
                        2014     6
                        2016     5
            Zyron_M     2015     5
            Zyshonne_M  1998     5
            Name: popularity, Length: 633239, dtype: int64
        
        
        
        The interface is as with the `by_state` attribute, but without the state specification.
        
        
        ```python
        d.national.loc['Vanessa_F']
        ```
        
        
        
        
            year
            1935       5
            1947      24
            1948      32
            1949      16
            1950      41
                    ... 
            2015    1687
            2016    1633
            2017    1486
            2018    1345
            2019    1188
            Name: popularity, Length: 74, dtype: int64
        
        
        
        # Plotting stuff
        
        
        ```python
        d.plot_popularity('Cora');
        ```
        
        
            
        ![png](img/output_29_0.png)
            
        
        
        
        ```python
        d.plot_popularity('Cora', 'GA');
        ```
        
        
            
        ![png](img/output_30_0.png)
            
        
        
        
        ```python
        d.plot_popularity(['Cora', 'Vanessa_F']);
        ```
        
        
            
        ![png](img/output_31_0.png)
            
        
        
        
        ```python
        d.plot_popularity('Cora', ['CA', 'GA']);
        ```
        
        
            
        ![png](img/output_32_0.png)
            
        
        
        
        ```python
        d.plot_popularity(['Cora', 'Vanessa_F'], ['CA', 'GA']);
        ```
        
        
            
        ![png](img/output_33_0.png)
            
        
        
        # Misc
        
        ## gender-ambiguous names
        
        We'll call the "femininity" of a name be the proportion of times it was used (all states, all time) to name a girl, 
        and the "masculinity" of a name be defined accordingly. 
        
        
        ```python
        d.femininity_of_name.iloc[12000:12010]
        ```
        
        
        
        
            Lemmie      0.161290
            Kashmere    0.161290
            Clary       0.162162
            Sung        0.162393
            Kyrie       0.163527
            Cedar       0.163686
            Masyn       0.163895
            Naveen      0.165605
            Chai        0.166667
            Atlee       0.167382
            dtype: float64
        
        
        
        
        ```python
        d.femininity_of_name.plot(figsize=(17, 5), ylabel='femininity');
        ```
        
        
            
        ![png](img/output_38_0.png)
            
        
        
        
        ```python
        d.masculinity_of_name.iloc[19000:19010]
        ```
        
        
        
        
            Berkley     0.108889
            Dasani      0.110092
            Sharone     0.111111
            Ifeoluwa    0.111111
            Rama        0.111111
            Scout       0.111486
            Brownie     0.111732
            Lashon      0.113158
            Indigo      0.113364
            Argie       0.113636
            dtype: float64
        
        
        
        
        ```python
        d.masculinity_of_name.plot(figsize=(17, 5), ylabel='masculinity');
        ```
        
        
            
        ![png](img/output_40_0.png)
            
        
        
        The (gender-)"ambiguity" of a name can thus be defined as twice the minimum of it's femininity and masculinity. 
        
        By defining the ambiguity thusly, we have a score that is between 0 and 1. It is maximal (1) when an equal proportion of boys and girls were named with the name. It is minimal (0) when only one gender was named with it.
        
        Note that this score is raw (or "un-smoothed"). It's computed with the raw counts, so the extreme scores will usually be for names with very low counts. 
        
        
        ```python
        d.ambiguity_of_name
        ```
        
        
        
        
            Munachiso    1.0
            Addis        1.0
            Deshone      1.0
            Gal          1.0
            Rajdeep      1.0
                        ... 
            Sharelle     0.0
            Analy        0.0
            Sharayah     0.0
            Sharaya      0.0
            Aaban        0.0
            Length: 31862, dtype: float64
        
        
        
        
        ```python
        t = d.ambiguity_of_name
        print(f"There are {len(t[t > 0])} (gender-)ambiguous names")
        ```
        
            There are 3090 (gender-)ambiguous names
        
        
        
        ```python
        t = d.ambiguity_of_name
        t[t > 0].plot(figsize=(17, 5), ylabel='gender-ambiguity');
        ```
        
        
            
        ![png](img/output_44_0.png)
            
        
        
        
        ```python
        t = list(d.ambiguous_names)
        print(f"{len(t)} (gender-)ambiguous names:")
        print(*t[:9], '...', sep=', ')
        ```
        
            3090 (gender-)ambiguous names:
            Nolie, Tyrese, Linn, Savannah, Bryn, Rei, Abby, Shilo, Tracy, ...
        
        
        
        ```python
        
        ```
        
Platform: any
Description-Content-Type: text/markdown
