Metadata-Version: 2.3
Name: LSMS_Library
Version: 0.2.4
Summary: Abstraction layer for Living Standards Measurement Survey data
License: CC-BY-4.0
Author: Ethan Ligon
Author-email: ligon@berkeley.edu
Requires-Python: >=3.11
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Description-Content-Type: text/plain

* Streaming dvc files
   A =dvc pull= will download dvc files to your local repository.
   But this may not be the best way to proceed!  In particular, =dvc=
   offers an api which permits one to "stream" or cache files, leaving
   your storage local to the working repository free of big data
   files.

   To illustrate,
   #+begin_src python
     import dvc.api
     import pandas as pd

     with dvc.api.open('BigRemoteFile.dta',mode='rb') as dta:
         df = pd.read_stata(dta)
   #+end_src
   This will result in a =pandas.DataFrame= in RAM, but will use no
   additional disk (except that, depending on what's being used as the
   dvc store, the file may actually be stored in =.dvc/cache=; this
   cache can be cleared with =dvc gc=).

** Pulling dvc files
   If you need the actual file instead of a "stream" you can instead
   "pull" the dvc files, using
   #+begin_src sh
   dvc pull
   #+end_src
   and files should be added from the remote dvc data store to your
   working repository. 

* Adding New Data
** Additional S3 Credentials
Write access to the remote s3 repository requires additional credentials; contact =ligon@berkeley.edu= to obtain these.

** Procedure to Add Data
   To add a new LSMS-style survey to the repo, you'll follow the
   following steps.  Here we give the example of adding a 2015--16
   survey from Uganda, obtained from
   https://microdata.worldbank.org/index.php/catalog/3460.  The same
   steps should work for you /mutatis mutandis/:

  1. Create a directory corresponding to the country or area; e.g., 
     #+begin_src sh
     mkdir Uganda
     #+end_src
  2. Create a /sub/-directory indicating the time period for the
     survey; e.g., 
     #+begin_src sh
     mkdir Uganda/2015-16
     #+end_src
  3. Create a =Documentation= sub-directory for each survey; e.g.,
     #+begin_src sh
     mkdir Uganda/2015-16/Documentation
     #+end_src
     In this directory include the following files:
     - SOURCE :: A text file giving both a url (if available) and
       citation information for the dataset.
     - LICENSE :: A text file containing a description of the license
       or other terms under which you've obtained the data.
  4. Add other documentation useful for understanding the data to the
     =Documentation= sub-directory.

  5. Add all the contents of the =Documentation= folder to the =git= repo;
     e.g., 
     #+begin_src sh
     cd ./Uganda/2015-16/Documentation
     git add .
     git commit -m"Add Uganda 2015-16 documentation to repo."
     git push
     #+end_src

  6. Create a =Data= sub-directory for each survey; e.g.,
     #+begin_src sh
     mkdir Uganda/2015-16/Data
     #+end_src

  7. Obtain a copy of the data you're interested in, perhaps as a zip
     file or other archive.  Store this in some temporary place, and
     unzip (or whatever) the files into the relevant Country/Year/Data
     directory, taking care to preserve any useful directory structure
     in the archive.  E.g.,
     #+begin_src sh
     cd Uganda/2015-16 && unzip -j /tmp/UGA_2015_UNPS_v01_M_STATA8.zip
     #+end_src
  8. Add the data you've unarchived to =dvc=, then add the /pointers/
     (i.e., files with a .dvc extension to git).  For the Uganda case we assume that
     all the relevant data comes in the form of =stata= *.dta files,
     since this is what we downloaded from the World Bank.  For example,
     #+begin_src sh
     cd ../Data
     dvc add *.dta
     git commit -m"Add Uganda/2015-16/Data/*.dta files to dvc store."
     git pull && git push
     #+end_src
  9. Push the data files to the dvc store. Make sure you have good
     internet connection!  Then a simple
     #+begin_src sh
     dvc push
     #+end_src
     will copy the data to the remote data store.  NB: If this is the
     first time you've done this for this repository, then you'll
     first need to jump through some simple hoops to authenticate with
     gdrive.
  10. With the files pushed to the dvc store, you won't need them
      locally anymore, so you can do something like
      #+begin_src sh
      cd ../Data && rm *.dta
      #+end_src
      or (if you have a more complex directory structure) perhaps
      #+begin_src sh
      find ../Data -name \*.dta -exec rm \{\} \;
      #+end_src

