# from Machine Learning to Machine Reasoning

## Contents

## Introduction

Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. "Artificial intelligence: a general survey." In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. "Causality: models, reasoning, and inference." Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. "On the hardness of approximate reasoning" Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.

This approach is explored along a number of auxiliary tasks.

## Auxiliary Tasks

The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.

**Face-based Identification**

Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

**Natural Language Processing**

The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.

## Reasoning Revisited

Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. "Global training of document processing systems using graph transformer networks." In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

## Probabilistic Models

Graphical models describe the factorization of joint probability distributions into lower-dimensional conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW> Buntine, Wray L "Operations for learning with graphical models" in The Journal of Artificial Intelligence Research, (1994). </ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.

## Reasoning Systems

We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:

*First order logic reasoning*- Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W."Grundzüge der theoretischen Logik." Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*Probabilistic reasoning*- Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*Causal reasoning*- The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*Newtonian Mechanics*- Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*Spatial reasoning*- A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*Social reasoning*- Changes of viewpoints also play a very important role in social interactions.

*Non-falsifiable reasoning*- Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities, as their reliability cannot be ascertained.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

## Association and Dissociation

We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of *n* words, the figure below shows how *n-1* applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "*((the cat) (sat (on (the mat))*". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. "Fast semantic extraction using a novel neural network architecture." In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. "Deep learning for efficient discriminative parsing." In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

The association and dissociation modules can be seen similar to the `cons`

, `car`

, and `cdr`

primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (`cons`

, "association") or extract the individual objects (`car`

and `cdr`

, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref>
Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).
</ref>

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation. This poor performance is due to fixed geometry of their spatial pooling layers. The lower layers aggregate the local features based on a predefined pattern and pass them to upper levels/ this aggregation causes poor spatial and orientation accuracy. One approach for resolving this drawback is parsing mechanism where intermediate representations can be attached to the image patches of image.

The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref> Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012). </ref><ref> Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013). </ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time).

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

## Universal Parser

The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

## More Modules

The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

- Operator grammars <ref>Harris, Z. S. "Mathematical structures of language." Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
- There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

## Representation Space

Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.

- In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
- Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
- The representation space can also be a space of probability distributions defined on a vector of discrete random variables. By this representation, the learning algorithms can be expressed as stochastic sampling in which sampling image at regular spaced locations replaced by the sampling at non-uniform spaced locations. Gibbs sampling or Markov-chain Monte-Carlo are two prominent technique for this purpose.

## Conclusions

The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

## Discussion

- The majority of the machine learning discussion above is of supervised learning but it does not discuss other learning techniques such as un-supervised learning nor reinforcement learning. One could argue, unsupervised and reinforcement learning induces some kind of automatic machine reasoning about the search space it explores, and aims to either simplify it or learn without explicit desired outputs to compare with.

- The paper makes a bold claim that just because we as humans don't need to think to understand what we see, what we say, therefore don't use any formal analysis. This statement is problematic because our brains for these tasks perhaps work in a subconscious level, as Moravec pointed out "We are all prodigious Olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it.",<ref>Moravec H. (1988). Mind Children: The future of robot and human intelligence. Massachusetts, Harvard University Press.</ref>

## Bibliography

<references />