# ILLC Course: Measuring information in very large data sets

6 ECT, january 2017

Coordinator: Pieter Adriaans

Requirements: we like the course to be self-contained, some background in information theory and/ore semantic web theory would help (but if you don't like mathematics you'll probably have a hard time).

Workforms: Lectures, guest lecturers, hands-on experiments on real data sets, workgroups, presentations, essay

Contributors: Frank van Harmelen, Peter Bloem, Leen Torenvliet, Pieter Adriaans

Background: the results of the Commit P23 subproject have shown that bringing together researchers from various disciplines like information theory, semantic web, and logic can lead to promising synergetic results. This course aims at students and researchers from the UvA, VU and ILLC to further explore these possibilities.

In this course we explore the interplay between recent developments in the field of information theory in the light of research issues generated by the study of extremely large data sets. Main themes: 1) Expanding the power of information theoretical measures that currently mainly deal with strings of messages or bits to cover higher dimensional data, sparse data, classes of languages, sets and graphs. 2) Analyze the behavior of various measures for complexity and meaningful information. 3) Practical application of these insights for efficient model selection in extremely large datasets.

**Fundamental Research Questions:**

Theoretical |
Empirical |

What is the interaction between computation
and information? |
What are the principles that govern the emergent structure of knowledge graphs? |

Is there a notion of information that unifies current entropy based approaches (Shannon, Kolmogorov)? |
How is the meaning of a knowledge graph affected by its structure? |

What is meaningful information? | How do we measure information in knowlegde graphs? |

Time & Place

We have two locations:

https://rooster.uva.nl/locationinfo/630GII213

https://rooster.uva.nl/locationinfo/630GII305

The schedule for the classes is:

Week 1 (9-13 January):

Mon: 11:00-13:00 SP G3.05

Tue-Fri 15:00-17:00 SP G3.05

Week 2 (16-20 January):

Mon: 11:00-13:00 SP G3.05

Tue & Wed 15:00-17:00 SP G2.13

Thu 15:00-17:00 SP G3.05

Fri 15:00-17:00 SP G2.13

Classes

- Short overview (both historical and systematical) of the concept of Information (See material in the Stanford Encyclopedia Lemma on Information (http://plato.stanford.edu/entries/information/). This overview culminates in a list of open problems in philosophy of information: http://plato.stanford.edu/entries/information/supplement.html
- Current research issues in the study of the semantic web, with as main theme: Our theory is falling behind our technology. (See: https://sssw.org/2016/?page_id=386). The aim is to show the connection between some of the open problems in philosophy of information and the current research issues.
- Aspects of information theory: optimal code, bayes law, relation with entropy, randomness deficiency, incompressibility of the set of natural numbers. Two-part code optimization, structure functions. Normalized Compression Distance. The aim is to show that the theory really works in practice, but that there are flaws.
- Presentation of some new ideas on information measurement and meaningful information, insofar as they clarify some of the problems: https://arxiv.org/abs/1611.07829.
- Applications of the theory. See e.g. Thesis P.Bloem Chapter V.

Week 3 and 4.

Projects either empirical, working with real life datasets, or more theoretical, writing a paper about a selected subject.

Presentations:

Frank van Harmelen

Pieter Adriaans

Standard reference texts:

*Grammatical Inference*, Cambridge UP. 2010.

*Introduction to Automata Languages, and Compuatation,*Addison Wesley, snd ed. 2001.

*Elements of Information Theory*, Wiley, 2nd ed. 2006.

*An Introduction to Kolmogorov Complexity and its Applications*, 3rd ed. Springer 2008.

*The Minimum Description Length Principle*, MIT Press, 2007.

*Philosophy of Information*, North Holland, 2008.

**References and Background materials**

**P.W. Adriaans,**"Information",*The Stanford Encyclopedia of Philosophy*(Winter 2012 Edition), Edward N. Zalta (ed.), 2016.**P.W. Adriaans,**Facticity as the amount of self-descriptive information in a data set**, arXiv:1203.2245 [cs.IT],**2012.**Meri Lisi**, Some remarks on the Cantor pairing function

https://www.dmi.unict.it/ojs/index.php/lematematiche/article/view/14/13**Melvyn B. Nathanson**:Cantor polynomials and the Fueter-Polya theorem: https://arxiv.org/abs/1512.08261**Peter Bloem, Steven de Rooij, Pieter Adriaans**, Two Problems for Sophistication, Algorithmic Learning Theory, Volume 9355 of the series Lecture Notes in Computer Science pp 379-394. (Find related material here).**Peter Bloem, Francisco Mota, Steven de Rooij, Luís Antunes, Pieter Adriaans,**Algorithmic Learning Theory, A Safe Approximation for Kolmogorov Complexity, Volume 8776 of the series Lecture Notes in Computer Science pp 336-350, 2014.**P.W. Adriaans, P.E. Van Emde Boas**, Information, Compuation and the Arrow of Time, in COMPUTABILITY IN CONTEXT, Computation and Logic in the Real World, edited by S. Barry Cooper (University of Leeds, UK) & Andrea Sorbi (Universita degli Studi di Siena, Italy), 2011.**P.W. Adriaans**, Between Order and Chaos: The Quest for Meaningful Information, Theory of Computing Systems, Volume 45 , Issue 4 (July 2009), Special Issue: Computation and Logic in the Real World; Guest Editors: S. Barry Cooper, Elvira Mayordomo and Andrea Sorbi Pages 650-674, 2009- https://plato.stanford.edu/entries/recursive-functions/
- https://plato.stanford.edu/entries/axiom-choice/
**P. Grünwald, P. Vitány**i, Algorithmic complexity.- Normalized Google Distance