Predicting software projects
By James Lea
Introduction
Digital construction is a mirror counterpart to physical construction – the construction of software.
Software drives the modern world. Developing, testing and delivering software on time, to the desired quality, is a major challenge to organisations worldwide, and to projects in particular.
When developing the business case, tendering for, and mobilising projects we must seek answers to many questions, such as:
- What proportion of the developed system will be software?
- What language might we choose to write it in?
- How might the code be structured?
- What size will the software code base be?
- How long will the software take to write?
- How many defects will there be, and of these, how many need to be fixed before we can release?
- How many people need to work on it?
- What range of skills will be needed?
- What dependencies will it have on existing systems?
- How often do we need to update the software?
All these questions can be answered using data.
Opportunity
Historically, organisation data that could answer these questions used to sit in silos: normally silos that evaporated when the project closed. Occasionally, the data would be shared across the organisation – but to varying degrees of success.
Now, the software world has been transformed beyond all recognition. We have enormous repositories of open source software. Github is one of these. This data set is bigger than any other data set I have encountered.
As of writing, github claims:
- 56+ million Developers
- 3+ million Organizations
- 100+ million Repositories
This is an enormous data set, and provides a great opportunity for us to exploit data analytics and data science to price up software projects, and steer them to success.
Database
Github can be regarded as a huge database. Not only does it have a huge number of repositories, but each repository is itself a git distributed version control database. Each git repository tracks changes to files, who made them, when, on which branch, and the flow of those changes through the repository.
Worked example
This stackoverflow page describes how you can get the number of lines of code from a locally-cloned github repository.
Challenge 1 - Estimating Tool
Develop a tool (to be made available on a github public repository of course!) that can interrogate the github data set to answer project estimating and forecasting questions.
For example, imagine a scenario where a business owner has identified a market opportunity, to develop a novel fitness and activity tracking application.
They may ask their sales and engineering teams this question:
“We’re planning to develop a fitness and activity tracking application, but we’ve not developed one before. Go find out what’s already out there, and get some data that helps us gauge the market potential.
I want to know what their target platforms are (Android, Linux, iOS, Windows), what languages those apps have been written in (C, Python, etc.).
For each application, give me a breakdown of how many lines of code they are, how long they took to write, how many bug fixes there have been, and how actively they are maintained. Estimate some upper and lower bounds on the cost of development, based on the number of people that have worked them.”
Worked Example - Linux kernel
$ git clone –depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git
$ cloc linux-git
$ cloc linux-git 72137 text files. 71680 unique files. 10738 files ignored.
github.com/AlDanial/cloc v 1.82 T=135.11 s (454.7 files/s, 215984.8 lines/s)
Language | files | blank | comment | code |
---|---|---|---|---|
C | 29973 | 3000212 | 2418070 | 15289301 |
C/C++ Header | 21651 | 611387 | 1092333 | 5154346 |
reStructuredText | 2813 | 139617 | 57436 | 386201 |
Assembly | 1273 | 45792 | 97779 | 222246 |
JSON | 329 | 2 | 0 | 171946 |
YAML | 1682 | 29146 | 7665 | 130314 |
Bourne Shell | 727 | 20691 | 13933 | 80611 |
make | 2624 | 10093 | 11133 | 45436 |
SVG | 59 | 78 | 1159 | 37555 |
Perl | 63 | 7073 | 4928 | 35728 |
Python | 125 | 5545 | 5013 | 28449 |
yacc | 9 | 693 | 355 | 4761 |
PO File | 5 | 791 | 918 | 3077 |
lex | 9 | 345 | 303 | 2103 |
C++ | 10 | 349 | 138 | 1935 |
Bourne Again Shell | 51 | 297 | 247 | 1304 |
awk | 10 | 149 | 119 | 1084 |
Glade | 1 | 58 | 0 | 603 |
NAnt script | 2 | 143 | 0 | 564 |
Cucumber | 1 | 30 | 50 | 183 |
Windows Module Definition | 2 | 15 | 0 | 109 |
m4 | 1 | 15 | 1 | 95 |
CSS | 1 | 28 | 29 | 80 |
XSLT | 5 | 13 | 26 | 61 |
vim script | 1 | 3 | 12 | 27 |
Ruby | 1 | 4 | 0 | 25 |
INI | 1 | 1 | 0 | 6 |
sed | 1 | 2 | 5 | 5 |
—————————- | —————– | ————— | —————- | ———– |
SUM: | 61430 | 3872572 | 3711652 | 21598155 |
—————————- | —————– | ————— | —————- | ———– |
Visualisations
As well as statistics, develop visualisations that help non-technical staff involved in pricing up work understand what the data is telling them, so that we can get better at figuring out the metrics that drive successful projects.
Getting Started
To get started on this challenge, explore the data accessible through github.
There are comprehensive search facilities. The public repositories can be searched directly
https://github.com/search/advanced
or through an API:
https://docs.github.com/en/rest/reference/repos
For a useful guide on programmatic interfaces, using Python, gh, and other tools, check out this stackoverflow article
Load the data in, run some analytics, and address the challenge in an easy-to-use way!
Get in touch
If you’d like to discuss this challenge in more detail (including some detailed criteria, and how the analytics could be developed), please reach out to us.