In this short article I set out a challenge to the project data analytics community!


Digital construction is a mirror counterpart to physical construction – the construction of software.

Software drives the modern world. Developing, testing and delivering software on time, to the desired quality, is a major challenge to organisations worldwide, and to projects in particular.

When developing the business case, tendering for, and mobilising projects we must seek answers to many questions, such as:

  • What proportion of the developed system will be software?
  • What language might we choose to write it in?
  • How might the code be structured?
  • What size will the software code base be?
  • How long will the software take to write?
  • How many defects will there be, and of these, how many need to be fixed before we can release?
  • How many people need to work on it?
  • What range of skills will be needed?
  • What dependencies will it have on existing systems?
  • How often do we need to update the software?

All these questions can be answered using data.


Historically, organisation data that could answer these questions used to sit in silos: normally silos that evaporated when the project closed. Occasionally, the data would be shared across the organisation – but to varying degrees of success.

Now, the software world has been transformed beyond all recognition. We have enormous repositories of open source software. Github is one of these. This data set is bigger than any other data set I have encountered.

As of writing, github claims:

  • 56+ million Developers
  • 3+ million Organizations
  • 100+ million Repositories

This is an enormous data set, and provides a great opportunity for us to exploit data analytics and data science to price up software projects, and steer them to success.


Github can be regarded as a huge database. Not only does it have a huge number of repositories, but each repository is itself a git distributed version control database. Each git repository tracks changes to files, who made them, when, on which branch, and the flow of those changes through the repository.


Develop a tool (to be made available on a github public repository of course!) that can interrogate the github data set to answer project estimating and forecasting questions.

For example, imagine this scenario:

“I need to develop a fitness and activity tracking app. Tell me what’s already out there, what their target platforms are (Android, Linux, iOS, Windows), what languages those apps have been written in (C, Python, etc.).

For each app, give me a breakdown of how many lines of code they are, how long they took to write, how many bug fixes there have been, and how actively they are maintained. Estimate some upper and lower bounds on the cost of development, based on the number of people that have worked them.”


As well as statistics, develop visualisations that help non-technical staff involved in pricing up work understand what the data is telling them, so that we can get better at figuring out the metrics that drive successful projects.

Getting Started

To get started on this challenge, explore the data accessible through github.

There are comprehensive search facilities. The public repositories can be searched directly

or through an API:

For a useful guide on programmatic interfaces, using Python, gh, and other tools, check out this stackoverflow article

Load the data in, run some analytics, and address the challenge in an easy-to-use way!

Get in touch

If you’d like to discuss this challenge in more detail (including some detailed criteria, and how the analytics could be developed), reach out.