Introduction

Digital construction is a mirror counterpart to physical construction – the construction of software.

Software drives the modern world. Developing, testing and delivering software on time, to the desired quality, is a major challenge to organisations worldwide, and to projects in particular.

When developing the business case, tendering for, and mobilising projects we must seek answers to many questions, such as:

  • What proportion of the developed system will be software?
  • What language might we choose to write it in?
  • How might the code be structured?
  • What size will the software code base be?
  • How long will the software take to write?
  • How many defects will there be, and of these, how many need to be fixed before we can release?
  • How many people need to work on it?
  • What range of skills will be needed?
  • What dependencies will it have on existing systems?
  • How often do we need to update the software?

All these questions can be answered using data.

Opportunity

Historically, organisation data that could answer these questions used to sit in silos: normally silos that evaporated when the project closed. Occasionally, the data would be shared across the organisation – but to varying degrees of success.

Now, the software world has been transformed beyond all recognition. We have enormous repositories of open source software. Github is one of these. This data set is bigger than any other data set I have encountered.

As of writing, github claims:

  • 56+ million Developers
  • 3+ million Organizations
  • 100+ million Repositories

This is an enormous data set, and provides a great opportunity for us to exploit data analytics and data science to price up software projects, and steer them to success.

Database

Github can be regarded as a huge database. Not only does it have a huge number of repositories, but each repository is itself a git distributed version control database. Each git repository tracks changes to files, who made them, when, on which branch, and the flow of those changes through the repository.

Worked example

This stackoverflow page describes how you can get the number of lines of code from a locally-cloned github repository.

Challenge 1 - Estimating Tool

Develop a tool (to be made available on a github public repository of course!) that can interrogate the github data set to answer project estimating and forecasting questions.

For example, imagine a scenario where a business owner has identified a market opportunity, to develop a novel fitness and activity tracking application.

They may ask their sales and engineering teams this question:

“We’re planning to develop a fitness and activity tracking application, but we’ve not developed one before. Go find out what’s already out there, and get some data that helps us gauge the market potential.

I want to know what their target platforms are (Android, Linux, iOS, Windows), what languages those apps have been written in (C, Python, etc.).

For each application, give me a breakdown of how many lines of code they are, how long they took to write, how many bug fixes there have been, and how actively they are maintained. Estimate some upper and lower bounds on the cost of development, based on the number of people that have worked them.”

Worked Example - Linux kernel

$ git clone –depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git

$ cloc linux-git

$ cloc linux-git 72137 text files. 71680 unique files. 10738 files ignored.

github.com/AlDanial/cloc v 1.82 T=135.11 s (454.7 files/s, 215984.8 lines/s)

Language files blank comment code
C 29973 3000212 2418070 15289301
C/C++ Header 21651 611387 1092333 5154346
reStructuredText 2813 139617 57436 386201
Assembly 1273 45792 97779 222246
JSON 329 2 0 171946
YAML 1682 29146 7665 130314
Bourne Shell 727 20691 13933 80611
make 2624 10093 11133 45436
SVG 59 78 1159 37555
Perl 63 7073 4928 35728
Python 125 5545 5013 28449
yacc 9 693 355 4761
PO File 5 791 918 3077
lex 9 345 303 2103
C++ 10 349 138 1935
Bourne Again Shell 51 297 247 1304
awk 10 149 119 1084
Glade 1 58 0 603
NAnt script 2 143 0 564
Cucumber 1 30 50 183
Windows Module Definition 2 15 0 109
m4 1 15 1 95
CSS 1 28 29 80
XSLT 5 13 26 61
vim script 1 3 12 27
Ruby 1 4 0 25
INI 1 1 0 6
sed 1 2 5 5
SUM: 61430 3872572 3711652 21598155

Visualisations

As well as statistics, develop visualisations that help non-technical staff involved in pricing up work understand what the data is telling them, so that we can get better at figuring out the metrics that drive successful projects.

Getting Started

To get started on this challenge, explore the data accessible through github.

There are comprehensive search facilities. The public repositories can be searched directly

https://github.com/search/advanced

or through an API:

https://docs.github.com/en/rest/reference/repos

For a useful guide on programmatic interfaces, using Python, gh, and other tools, check out this stackoverflow article

Load the data in, run some analytics, and address the challenge in an easy-to-use way!

Get in touch

If you’d like to discuss this challenge in more detail (including some detailed criteria, and how the analytics could be developed), reach out.