You surely would have loved playing around with integer sequences in your high school days. But have you ever given some thought to that all those random sequences could actually make some statistical sense?
The Online Encyclopedia of Integer Sequences (OEIS) is a catalog of integer sequences. It was created and maintained by Neil Sloane. The database basically contains sequences of integers carried out methodically, over 40 years. As of May 2011, the OEIS contained over 200,000 integer sequences. Its compilation has involved hundreds of mathematicians from around the world which contributes to its homogeneity. We are particularly interested in the frequency of occurrence of any integer n; N(n); in the database. This number N(n) marks the importance of n and it varies noticeably from one number to another. The concept of “importance” can be mathematically explained by concepts of algorithmic complexity. However, the observed curve does not match the curve predicted by Kolmogorov-Chaitin Complexity or simply algorithmic complexity, as there is a clear gap separating the distribution (N(n) vs n) into two clouds of points. This gap is generally referred to as Sloane’s Gap.
Image Source: arxiv.org
When plotting N(n) vs n, following features are evident:-
- Statistical regression shows that points N(n) cluster around k/n33 where k≈ 2.53*108. This is easily explained by algorithmic information theory. In short, there is not any problem with this occurrence in the plot.
- What is indeed somewhat cumbersome about this plot is, when it comes to explaining the gap in the distribution. Visual inspection clearly shows that there are actually two sub-clusters and there is a gap separating them. This is not predicted by algorithmic complexity and it is even not produced when a database is populated with automatically generated sequences. The gap is unexpected and requires an explanation. If there is a bias of mathematicians towards certain sequences of integers then it needs to be quantified. This is what we will try to do in this article.
A common use of Sloane’s encyclopedia is in determining the logic of a sequence of integer. If you give any sequence to the database then the program will certainly provide you the pattern of the sequence. Also if you enter an isolated number in the database, it can tell you exactly how many sequences does that number belong to. For example if you enter the Hardy-Ramanujan number, 1729; the program instantaneously tells following properties of this number:-
- It is the thirteenth number of the kind n3+1
- It is the third Carmichael number
- It is the fourth “factorial sextuple” that is to say, product of successive terms of the form (6n+1)
- It is the sum of the factors of a perfect square (33)2
- It is the third number of which if its digits are added together 1+7+2+9 = 19; yields its largest factor, 1729= 7*13*19
- It is a product of 19, a prime number, and its inverse 91
- It is the total number of ways to express 33 as the sum of six integers
and many more…..
Approximately 40 mathematicians constitute the “editorial committee” of the database, but any user may propose sequences that he/she finds interesting.
One can ask an interesting question at this stage. “Which numbers do not occur in the Sloane’s Encylopedia?” It was tracked down that the smallest number missing was 8795 followed in order by 9935, 11147, 11446, 11612, 11630….
Recently by augmentation of several thousands of new sequences, the missing numbers were found to have comprised of 11630, 12067, 12407, 12887……
WHY TO STUDY THESE ?
The instability over time of the sequences of missing numbers in OEIS suggests the need for a study of distribution of numbers rather than their presence or absence. Let us give a few examples. The value of N(1729) is 380. For its predecessor N(1728)= 622, and for its successor N(1730)= 106. The sequence (N(n)) is generally characterized by a decreasing curve, that is generally N(n-1)>N(n). However for certain numbers this property does not hold, that is they have more properties than their predecessors which is reflected in their frequency N(n). We can assign such numbers as “interesting”.
The first interesting number is 15, because N(15)= 34183 > N(14)= 32487. Some more interesting numbers are 16, 23, 24, 27, 28 ,29 etc. We insist on the fact that, although unquestionably dependent on certain individual decisions made by those who participate in building the sequence database, the database is not in itself arbitrary. The number of contributors is very large, and the idea that the database represents an objective view could be defended on the grounds that it comprises the independent view of each person who contributes to it.
Now talking about the shape of the distribution N(n) vs n curve, the gap is not clearly visible for first 300 numbers, so we exclude from our study numbers less than 300. Now using some empirical method of determining the boundary which I am not discussing here, indeed a boundary could be formulated precisely such that nearly 83% of the numbers close to the boundary fall inside the boundary. Clearly speaking what we can visually perceive is formidable mathematically in the curve.
We will call the upper cloud as A and lower cloud as B. Now some remarkable properties of upper cloud are that prime numbers and powers of two seem to situate more frequently in A. For example, 83 square numbers are found between 301 and 10000. Among these, 79 are located above the gap, and those who have missed namely 361, 484, 529, 676 have missed marginally from the limit value to be considered a part of A. The limit value is decided as I mentioned above by the mathematically defined boundary.
Similarly 99.7% of all the prime numbers situated within 301 and 10000 lie in A. Again, those who missed, missed very marginally.
One more class of integers that seem to be overrepresented in A are those that have a multitude of factors, that is number expressed as its prime factors counted with its multiples. All the numbers with at least ten multitude of factor occur entirely in A.
For a number to find its home in A, it must possess properties that are simple. Simple means that “it can be expressed in a few words”. Conversely if a number possesses simple properties then it also possesses many properties. Algorithmic complexity theory assigns a specific mathematical sense to the notion of simplicity. This theory proposes to measure the complexity of a finite object in binary code by the length of the shortest program that generates a representation of it. It also leads to a theorem of invariance that warrants a certain independence of programming language. The mathematical details are too complicated to be discussed here but we can infer that the precise calculation of the expected value of N(n) is impossible. When we plot this complexity curve of n vs n (where N(n) was generated using randomly generated functions. How these functions are generated is again left for the reader to read on his own), there was no such gap present in the graph. There is a decreasing oblique curve with clustering of points near the base and a mean near 0, that confirms with the algorithmic complexity and Sloane’s curve but no gap.
Image Source: arxiv.org
So is the gap due to a social effect? We suppose that the distribution anticipated by considerations of complexity is deformed by the social effect concomitant to it: mathematicians are indeed interested in certain sequences more than the other! This interest can have cultural reasons or mathematical reasons, but in either case it brings with it an over-investment on the part of mathematical community. The numbers that receive this over-investment are not in general, but rather because they have certain regularities associated with them. Rather these numbers are situated near the pinnacle of the theoretical asymmetrical distribution. Owing to the community’s over-investment they have shifted to the right hand side of the graph, thus explaining the Sloane’s gap.
Thus the cloud of points representing the function N shows features that could be understood as being the result of at the same time human and purely mathematical factors.