In Project Science we use a wide range of tools to deliver compelling insights into data. We often get asked what tools we use, so here’s a list:
We use a wide range of programming languages. After you’ve been programming for over 20 years, you see more commonalities than differences (and develop a keen sense for what’s practical). Here are some of the languages we use:
C is a venerable language, having been developed in the early 1970s. C is a low level language, and can be seen as a portable assembler. It can behave like portable assembler too, with its pointer-based programming model. However, we use it with $ gcc - Wall -Wextra to turn on warning flags, and find the modern compilers generate plenty of useful messages that ensure we maximise type-safety and constrain the language to improve robustness.
Haskell, described as “an advanced, purely functional programming language”. It’s a revelation in many regards, coming closest to pure mathematics (try learning about fmaps, functors and monads - like learning general relativity in physics!). It’s a tough language to work in, particularly compared to the ease of Python, but - and this is a big but - once it compiles, a Haskell program “just works”. Haskell has remarkable modularity, which tames complexity. Learning to think in a functional style shapes your thinking in all kinds of other ways, and for this alone I recommend Haskell.
Python is a wonderful general-purpose language that serves as the “glue” for many projects. Most machine learning tutorials and starter projects assume Python.
R, from the R Project for Statistical Computing, provides a great range of analysis tools, integrates with various databases, and generates useful publication-quality charts.
Computing is now pervasive, and the amount of low-cost processing power at our fingertips is unprecedented, and continues to grow exponentially.
Our desktop PCs with NVidia graphics cards provide a solid combination of serial speed through the CPU, with parallel processing through the GPU cores. This provides the power to train machine learning models. Less power is required (but still helpful) to then run the models.
We really like the NVidia Jetson Nano. One of the latest in Single Board Computers (SBCs), this has a quad-core Arm 64 bit CPU paired with a 128 core NVidia GPU, and a beefy heatsink. Best of all, it’s tiny, yet hooks up via USB ports and HDMI to a regular mouse, keyboard and screen. It has most of the power of a desktop PC, but only uses 20 watts.
The Raspberry Pi provides us with additional machine learning power, for example, speech recognition systems. Like the Jetson Nano, it’s quick, and well-supported.
Of course, there’s cloud-computing power for when we need to scale.
Alongside Unix text editors, we like Jupyter notebooks because they enable us to write Python code interleaved with natural language explanatory text. This combination yields a “compilable” document, a type of literate programming.
LaTeX features in our toolkit too. The production quality output is inarguable, and the separation of content from presentation frees you up to focus on the content.
We like Markdown too - this website is written in Markdown.
Statistics and Analytics
As an alternative to the ubiquitous Microsoft Office, LibreOffice provides a compelling offer.
Jamovi is a real-time statistical spreadsheet we’re getting into.
With a small memory footprint SQLite is a great choice for embedded systems. For larger server-based systems we use SQL. There are many references for SQL, but the definitive one is ISO/IEC 9075-1:2016 SQL — Part 1: Framework.
Graph databases are the ’new thing’, although arguably they’ve been around since before SQL. We like:
- Apache TinkerPop, a graph computing framework for graph databases and analytics systems.
- Neo4j Graph Platform, “a native graph database, built from the ground up to leverage not only data but also data relationships.”
Neo4j is fully pre-packaged, with a built-in browser-based query and visualisation tool. Neo4J is very popular, with a wide marketing campaign that’s winning converts.
Apache TinkerPop is harder to use, but, in our experience, has a more robust interface with a stronger type model: essential for ‘right first time’ systems.
We suggest you start with Neo4j, then try Apache TinkerPop Gremlin with Python.
Microsoft Power BI is the de-facto choice, and Excel goes a long way, but Graphia is cool.
- vis.js, a browser based visualisation library.
- d3.js “Data-driven documents” - very cool, with a great library of examples.
The explosion in computing power available through CPUs and GPUs enables artificial neural nets to work on desktop computers in sensible timeframes. Coupled with the wide availability of training libraries required to “program” the artificial neural net, we have mainstream machine learning.
It’s a very fast moving area. We recommend these:
Tensor Flow is an “end-to-end open source machine learning platform”.
TensorBoard is handy for visualising machine learning models during training.
NVidia Jetson Nano, mentioned earlier - the Ubuntu operating system this runs comes packaged with all the necessary libraries, and some great tutorials.
Python is a very popular programming language used to interface with machine learning libraries.
PyTorch is “an open source machine learning framework that accelerates the path from research prototyping to production deployment”
Formal Specification Languages
The Z notation is the gold standard for specifying systems. Hard to use, but very precise and inarguable: exactly what is needed for high integrity and safety-related systems. Z and Haskell make a challenging but convincing combination for when nothing but the very best will do.
We’re a growing fan of Behaviour Driven Development, in which structured natural language is converted into executable tests. See Behaviour Driven Development on Wikipedia.
Drop us a line if we’ve missed anything out - we’d love to hear about what you use!