Predicting software projects

By James Lea

April 25, 2021

Introduction

Digital construction is a mirror counterpart to physical construction – the construction of software.

Software drives the modern world. Developing, testing and delivering software on time, to the desired quality, is a major challenge to organisations worldwide, and to projects in particular.

When developing the business case, tendering for, and mobilising projects we must seek answers to many questions, such as:

What proportion of the developed system will be software?
What language might we choose to write it in?
How might the code be structured?
What size will the software code base be?
How long will the software take to write?
How many defects will there be, and of these, how many need to be fixed before we can release?
How many people need to work on it?
What range of skills will be needed?
What dependencies will it have on existing systems?
How often do we need to update the software?

All these questions can be answered using data.

Opportunity

Historically, organisation data that could answer these questions used to sit in silos: normally silos that evaporated when the project closed. Occasionally, the data would be shared across the organisation – but to varying degrees of success.

Now, the software world has been transformed beyond all recognition. We have enormous repositories of open source software. Github is one of these. This data set is bigger than any other data set I have encountered.

As of writing, github claims:

56+ million Developers
3+ million Organizations
100+ million Repositories

This is an enormous data set, and provides a great opportunity for us to exploit data analytics and data science to price up software projects, and steer them to success.

Database

Github can be regarded as a huge database. Not only does it have a huge number of repositories, but each repository is itself a git distributed version control database. Each git repository tracks changes to files, who made them, when, on which branch, and the flow of those changes through the repository.

Worked example

This stackoverflow page describes how you can get the number of lines of code from a locally-cloned github repository.

Challenge 1 - Estimating Tool

Develop a tool (to be made available on a github public repository of course!) that can interrogate the github data set to answer project estimating and forecasting questions.

For example, imagine a scenario where a business owner has identified a market opportunity, to develop a novel fitness and activity tracking application.

They may ask their sales and engineering teams this question:

“We’re planning to develop a fitness and activity tracking application, but we’ve not developed one before. Go find out what’s already out there, and get some data that helps us gauge the market potential.

I want to know what their target platforms are (Android, Linux, iOS, Windows), what languages those apps have been written in (C, Python, etc.).

For each application, give me a breakdown of how many lines of code they are, how long they took to write, how many bug fixes there have been, and how actively they are maintained. Estimate some upper and lower bounds on the cost of development, based on the number of people that have worked them.”

Worked Example - Linux kernel

$ git clone –depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git

$ cloc linux-git

$ cloc linux-git 72137 text files. 71680 unique files. 10738 files ignored.

github.com/AlDanial/cloc v 1.82 T=135.11 s (454.7 files/s, 215984.8 lines/s)

Language	files	blank	comment	code
C	29973	3000212	2418070	15289301
C/C++ Header	21651	611387	1092333	5154346
reStructuredText	2813	139617	57436	386201
Assembly	1273	45792	97779	222246
JSON	329	2	0	171946
YAML	1682	29146	7665	130314
Bourne Shell	727	20691	13933	80611
make	2624	10093	11133	45436
SVG	59	78	1159	37555
Perl	63	7073	4928	35728
Python	125	5545	5013	28449
yacc	9	693	355	4761
PO File	5	791	918	3077
lex	9	345	303	2103
C++	10	349	138	1935
Bourne Again Shell	51	297	247	1304
awk	10	149	119	1084
Glade	1	58	0	603
NAnt script	2	143	0	564
Cucumber	1	30	50	183
Windows Module Definition	2	15	0	109
m4	1	15	1	95
CSS	1	28	29	80
XSLT	5	13	26	61
vim script	1	3	12	27
Ruby	1	4	0	25
INI	1	1	0	6
sed	1	2	5	5
—————————-	—————–	—————	—————-	———–
SUM:	61430	3872572	3711652	21598155
—————————-	—————–	—————	—————-	———–

Visualisations

As well as statistics, develop visualisations that help non-technical staff involved in pricing up work understand what the data is telling them, so that we can get better at figuring out the metrics that drive successful projects.

Getting Started

To get started on this challenge, explore the data accessible through github.

There are comprehensive search facilities. The public repositories can be searched directly

https://github.com/search/advanced

or through an API:

https://docs.github.com/en/rest/reference/repos

For a useful guide on programmatic interfaces, using Python, gh, and other tools, check out this stackoverflow article

Load the data in, run some analytics, and address the challenge in an easy-to-use way!

Get in touch

If you’d like to discuss this challenge in more detail (including some detailed criteria, and how the analytics could be developed), please reach out to us.