A Campus Data Toolset that Maximizes Code Reuse and Minimizes Frustrations

Scarlett Zuo (yz262), Paul Davis (prd9), Ian Dillon (isd23)

Data Solution

Summary of Paper

Present the standardized tool chain and practices we use to bring together data from many disparate sources. This dockerized tool chain and our practices makes it easy for us to work together and allows a lot of code reuse. It also supports using the same infrastructure in the cloud for parallelization, or scheduled ETLs.

A Campus Data Toolset that Maximizes Code Reuse and Minimizes Frustrations

We often bring together data from Workday, KFS, the campus directory, CU Person, CU Learn, and one or more of 7 research administration systems we run for standard reports, data feeds, or ad hoc analysis. An example of one of the data products we deliver is the list of people who need to report in our conflict of interest (COI) system. That list depends on the person's job title, whether they are an active employee, their COI reporting status last year, how they are funded, who sponsor's their research, and several other factors. Furthermore, each year the requirements change. We have many such data feeds and have standardized on a set of tools and practices so anyone on the team can take the code and run it in the future with an absolute minimum of startup time. It also supports very effective code reuse between team members and migrating the code to the cloud. This tool chain includes the following:

AWS Vault allows managing authentication securely.
Python and Pandas gives us rich tools for data consumption, transformation, merging, and analysis.
We develop standardized data extract libraries for all the campus data repositories we use so we don't have to reinvent the wheel when we want to use the data for something else.
Jupyter Notebooks allow us to do incremental development, data analysis, and visualization.
github allows us to manage code and dataset versioning and code review.
Docker Containers allow what we develop on the desktop to run unchanged when run in the cloud, scale it for a cluster, or when someone else on the team needs to run the code.
make files checkout, build and launch our development environment and create a standard project directory structure.

We will demonstrate this environment by walking through a sample notebook to demonstrate Python, and Pandas, running in a Jupyter Notebook pulling and merging data from several different sources and providing a pretty visualization.

A separate talk will go under the hood to show how we set up the environment and is suitable for those wanting to know more about either this setup, or how they might apply a similar setup to other tool chains. We'll also go in more depth about how to run the container in the AWS cloud.

We will provide the github repository needed to install all this in minutes on your desktop complete with the Cornell data libraries we have developed (you have to provide the credentials) and samples you can run based on public datasets.