Metadata-Version: 2.1
Name: HCSC
Version: 0.0.16
Summary: HCSC is a python package for developed as a part of interview process.
Home-page: https://github.com/sandeepkirangudla/HCSC
Author: Sandeep Kiran Gudla
Author-email: gsandeepkiran@gmail.com
License: NA
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Requires-Python: >=3.5
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy


# Covid - 19 Daily Cumulative Statistics  

##### The below project is a part of HCSC Machine Learning Engineer position.   
As part of HCSC's COVID19 response, the Data Science team needs to prepare daily/weekly updates of nationwide infection counts, organized by county. We use numeric FIPS code https://en.wikipedia.org/wiki/FIPS_county_code rather than  
state and county name to serve our results.  

For every FIPS code and date, the program generates: population, daily cases, daily deaths, cumulative cases to date, and cumulative death counts to date.  

## Citations  
The data is supplied by [New York Times](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html).  

For details on the data extraction please refer https://github.com/nytimes/covid-19-data  

# Program Execution  
The goal of the project is to generate a daily/weekly updates of nationwide infection counts, organized by county. Below is the step by step process of executing this program.  
The user import *HCSC* library from pip by running the following command.   
(<b>pip install HCSC </b>). This opens up a GUI in which the user have to provide  

 **Output Folder Path**  

## Data Files  

As a part of this project, there are 2 csvs files provided by [New York Times](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv) and [US Censes Data](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv). The path of the output file directory is given by the user.  

## Libraries  
Below are the libraries used as a part of this project.  

 - pandas  
 - numpy  
 - os  
 - datetime  

## Project Files & Folders  

 <ul>  
   <li><b>HCSC</b></li>  
   <p>This folder just has the init.py file required to initiate the package and program</p>  
   <li><b>config.py</b></li>  
   <p>This file initial configuration setting like paths etc.</p>  
   <li><b>LICENSE</b></li>  
   <p>This is an MIT license</p>  
   <li><b>setup.py</b></li>  
   <p>This is a setup file required by python to package and distribute the code. This file has all the indetail description and specifications.</p>  
   <li><b>data_process.py</b></li>  
   <p>This file has all the classes and functions required for the to pre-process the data.</p>  
   <li><b>data_clean.py</b></li>  
   <p>This file has all the classes and functions required for the to clean the data.</p>  
   <li><b>IO_path.py</b></li>  
   <p>This file has all the functions required to set the output and input paths.</p>  
   <li><b>merge.py</b></li>  
   <p>This file has all the functions required to merge the data into a final output on which we can summarize.</p>  
   <li><b>summary_stats.py</b></li>  
   <p>This file has all the classes and functions required to generate the summary output to desired location.</p>  
   <li><b>HCSC.py</b></li>  
   <p>This is the main file of the project. The user runs this file which will take input path and file and generate the summary table in given output path.</p>  
</ul>  

## Data Dictionary  
### `covid`  
| Variable |Class  | Description|  
|--|--|--|  
|date  |date  |Date of collision death (ymd)|  
| County| factor | US County Names |   
| State| factor | US State Names |   
| FIPS| factor | US FIPS code|   
|Cases|    integer|Covid Cases reported per day|  
|Deaths|   integer|Covid Deaths reported per day|  


### `population`  
We are extracting only the required columns from the US Censes data.  
| Variable |Class  | Description|  
|--|--|--|  
| STATE| factor | US State FIPS ID|   
|County|   factor|    US County FIPS ID|  
|POPESTIMATE2019|  integer|   US population estimate|  

## Data Cleaning and Preprocessing  
Below are the following steps used to clean and preprocess the data.  

### 1. Reading the Data  
The path to the input files are given in *config.py*. These files are read using pandas for analysis purposes.  

### 2. Cleaning the Data Files  
*Data_Process* class has all the necessary functions required to clean the data.  

Below are the steps used to clean the data file.  
  1. #### Cleaning and Mapping Columns  
     <p>I have used a column dictionarys to map the column names correctly which helps in standardizing the column names.</p>  

  2. #### Standardizing the Dates  
     <p>As a best practice, it is always recommended to standardize <i>Dates</i> columns. </p>  

3. #### Sort by Dates  
   <p>As a best practice, it is always recommended to sort data by <i>Dates</i> columns. </p>  

4. #### Standarizing FIPS columns.  
    1. <p>Population: Concatenating State_ID and County_ID to generate FIPS in population data, so that it can be joined with daily covid data.  
   2. Covid: Filling the empty and unknown FIPS IDs with a default value to standardize the column.</p> .  

## Merging the Data Frames  
After doing the data preprocessing and clean, we obtain clean files that we can merge. <i>merge.final_merge</i> takes in two data frames and output one final data frame on which we can do our analysis.  

## Generating Summary File and Plots  
The final step is generate the result. <i>summary_stats.SummaryStats.summarize</i> generates the summary file as a csv because it is very easy to interpret and do custom analysis on csv.  

## Future Edition  
### Interactive Plots  
We can include interactive plots using pyplot which help the end user analyze the data much more efficiently.

