Data scientists grapple every day with novel, complex, often vaguely-defined problems with potential value to their organization Before the solution can be automated, someone needs to figure out how to solve it. Complex, novel problems are most easily approached with code. For a number of reasons
Tailored customized visualizations to communicate to stakeholders - Whether it’s R, Python or another broadly used code it’s easy to see every step made and leverage code.
Why should I care about code first data science?
When stored in files folders and spread sheets can be difficult to keep track of how that work was done and why decisions and or mistakes were made
Version control open source systems like git allow tracking what changed, when, by whom, and why.
The Single Source of Truth
Is this the most recent [data, report, dashboard]?
The right tools allow us to create a single source of truth for our data, dashboards, and models. Version control allows us to track multiple versions of our code separately without creating conflicts.
Difficulty monitoring and auditing work
Code can be logged when run for auditing and monitoring. Code is explicit executed in a central server makes it easier to monitor, intervene and investigate than point clicks and drags are.
Difficulty reproducing work
Code enables reproducibility by explicitly recording every step taken. Open-source code can be deployed on many platforms. Easy to show and reproduce work in the future
No black box constraints.Access and combine all your data Analyze and present exactly as you need to.
Not limited to code less tool constraints. Limited models, basic visualizations. Code based approach provides leverage to make use of huge amounts of open source packages in R and Python enabling us to do much more with data for easier and faster data analysis from cutting edge machine modeling to novel and insightful data visualizations.
Code allows you to take the pieces developed over the years and put them together in new combinations to solve new problems.
Code can be copied, pasted, and modified to address novel problems as circumstances change.
Code is a valuable source of IP for your organization, increasing enterprise value. The code behind a solution can be used for scaling by orders of magnitude without having to put in the same initial investment and continue to reap rewards from initial work.
Siloed data science means ...
Centralized data science means ...
Centralized data science is more secure, scalable, efficient, and reproducible. Code is easier to centralize than points and clicks
A data scientist works for a company that produces electronic skateboards reports into head of sales for North America region.
The company makes a few different models including budget friendly Model A, more advanced Model S and long-range premium Model X
The data scientist is asked to prepare a report comparing sales and profitability of each model and also sales among new and returning customers. Data might look something like this, just in a CSV file.
Ideally it might live in a database in a data warehouse.
Just some transaction data; ID, each sale, the date, model, list price, cost to manufacture and shipping, region sold in and new or returning customer.
There are many ways you can build that report. In this instance it was decided to build in R using R Studio Pro and R markdown package reporting