Software’s Hidden Clockwork

Featured

The central result of this work is the following clockwork theorem which demonstrates that independently of programming language, how it was built, what it does or who built it, the component size distribution of any software system will always asymptote to obey:-

where pi is the probability of a component having a unique alphabet of ai tokens.  beta is a constant and around 3.2 for the tail of these data.  Just click on the picture to watch it asymptote as the total code population in 7 different languages increases from 1 million lines to over 40 million lines in 1 million increments.  The flat bit at the left is because programming languages have fixed tokens, for example keywords.

This theorem follows directly from the Conservation of Information which appears to exert the same influence on discrete systems as the Conservation of Energy does on physical systems.

Hatton_IFIP2011_19Oct2011.pdf (1.9mb)

(Note that this has now been extended to include Matlab code and almost 60 million lines of all source code as of March 2013.  The fit is good to a p-value of less than 10^(-16)).

Software defects in such systems are then forced to obey

where di is the number of defects and ti is the total number of tokens in the ith component.

Power-laws and the Conservation of
Information in discrete token systems: Part 1 General Theory

Power-laws and the Conservation of
Information in discrete token systems: Part 2 The role of defect

 

Chance Discovery Bibliographic Search Engine

An implementation of chance discovery as used in bibliographic search but with some of my own wrinkles associated with the entropy of documents. The zip archive contains a self-installing Windows executable. It consistently finds really interesting relationships in documents. It is bundled with the complete works of Shakespeare and the King James Bible courtesy of the splendid Project Gutenberg.

You can download it here:-

Chance Discovery Bibliographic Search Engine