Data Science

Pitfalls of code less data science

Pitfalls of code less data science – why you should care about a code first approach to data science

 Data scientists grapple every day with novel, complex, often vaguely-defined problems with potential value to their organization Before the solution can be automated, someone needs to figure out how to solve it. Complex, novel problems are most easily approached with code. For a number of reasons 

With Code,

  • Flexible: No black box constraints. Access and combine all your data, and analyze and present it exactly as you need to.
  • Iterative: Quickly make changes and updates in response to feedback, and then share updates with your stakeholders.
  • Reusable and extensible: Tackle similar problems in the future, and extend to novel problems as circumstances change.
  • Growing and valuable source of IP
  • lnspectable: Combined with version control - Track changes over time, discover errors and audit the approach.
  • Reproducible: Combine with environment and package management, ensure that you can rerun and verify your analyses.

Tailored customized visualizations to communicate to stakeholders - Whether it’s R, Python or another broadly used code it’s easy to see every step made and leverage code.

Why should I care about code first data science?

Pitfalls of code less data science

Difficulty tracking changes

When stored in files folders and spread sheets can be difficult to keep track of how that work was done and why decisions and or mistakes were made

  • Why did we make this decision in our analysis?
  • How long has this error gone unnoticed?
  • Who made this change?

Code-first fix:

Version control open source systems like git allow tracking what changed, when, by whom, and why.

The Single Source of Truth

Is this the most recent [data, report, dashboard]?

  • sales-data 2020-12 final FINAL Apr 21 NR(4).xlsx
  •  Where do I find __ ? 

Code-first fix:

The right tools allow us to create a single source of truth for our data, dashboards, and models. Version control allows us to track multiple versions of our code separately without creating conflicts.

Difficulty monitoring and auditing work

  • A data breach -internal or external - occurs, and it's difficult to uncover the source


Code-first fix:

Code can be logged when run for auditing and monitoring. Code is explicit executed in a central server makes it easier to monitor, intervene and investigate than point clicks and drags are.


Difficulty reproducing work

  • What did our model say 6 months ago?
  • Meeting regulatory requirements
  • Is our work truly portable?


Code-first fix:

Code enables reproducibility by explicitly recording every step taken. Open-source code can be deployed on many platforms. Easy to show and reproduce work in the future

Benefits of Code First Data Science


No black box constraints.Access and combine all your data Analyze and present exactly as you need to.

Not limited to code less tool constraints. Limited models, basic visualizations. Code based approach provides leverage to make use of huge amounts of open source packages in R and Python enabling  us to do much more with data for easier and faster data analysis from cutting edge machine modeling to novel and insightful data visualizations.

  • Bespoke dashboards and reports developed quickly
  • An App that runs a machine learning model written in Python for one use can be applied and or adapted to solve other challenges

Code allows you to take the pieces developed over the years and put them together in new combinations to solve new problems.

Reusability and Extensibility        

Code can be copied, pasted, and modified to address novel problems as circumstances change.

Code is a valuable source of IP for your organization, increasing enterprise value. The code behind a solution can be used for scaling by orders of magnitude without having to put in the same initial investment and continue to reap rewards from initial work.

  • A data wrangling pipeline that's reused for multiple reports
  • A custom application built for one client, expanded into a data product sold to many clients

Siloed vs Centralized Data Science related to code data science


Siloed data science means ...

  • Work happens on laptops and desktops
  • Data lives in spreadsheets, ,CSVs, and other files
  • Code is stored on hard drives, or maybe in cloud drives, file shares, or maybe not at all
  • Package use is uncontrolled, or heavily restricted (package installation is broken, need to request access)
  • Work can be difficult to find – suffering from single source of truth problem

Centralized data science means ...

  • Work happens in servers or on the cloud
  • Data lives in databases, pins, and other SingleSources of Truth
  • Code is stored and collaborated on in version control
  • Code can 'be audited and monitored as it's written and run
  • IT can approve packages, users can install them
  • Stakeholders know where to find what they're looking for

Centralized data science is more secure, scalable, efficient, and reproducible. Code is easier to centralize than points and clicks


A data scientist works for a company that produces electronic skateboards reports into head of sales for North America region.

The company makes a few different models including budget friendly Model A, more advanced Model S and long-range premium Model X

The data scientist is asked to prepare a report comparing sales and profitability of each model and also sales among new and returning customers. Data might look something like this, just in a CSV file.

Ideally it might live in a database in a data warehouse.

Just some transaction data; ID, each sale, the date, model, list price, cost to manufacture and shipping, region sold in and new or returning customer.

 There are many ways you can build that report. In this instance it was decided to build in R using R Studio Pro and R markdown package reporting

Ready to talk?

We're ready to listen. Unlock the true potential of your business today.