A hybrid neural network model of binocular rivalry

Mr Sten M Andersen

(Figures near bottom of page)



The phenomenon of binocular rivalry (BR) is described and differing explanations and models that have been proposed are examined. The conflicting psychophysical and neurobiological evidence for where in brains BR is located is used to argue that a more complex model of binocular rivalry is needed, in which BR is not situated at any one place, but can occur throughout the visual processing stream. Such a model is developed and a partial computer implementation of this new model is tested. The hypothesis that simple stimuli (as opposed to complex stimuli) will rival in the lower parts of the system is supported, but serious doubts are cast upon the model’s biological plausibility when the model is discussed in more detail. Modifications to the model are suggested, and directions for future research are put forward.


A hybrid neural network model of binocular rivalry


When two conflicting images are presented to the eyes, the images rival, that is, the two images will alternate in visibility every few seconds. The images upon the retinas do not change, but the percept does, hence, binocular rivalry has been seen as a tool to explore the neuronal correlates of conscious perceptions. Many models of binocular rivalry have been proposed, most focusing on either top-down or bottom-up processes to explain the phenomenon. In the current paper, a neural network architecture is suggested in which such restraints are not imposed; rather, it is proposed that rivalry can occur both early and late in visual processing, depending on the complexity of the stimuli. A computer implementation of the first part of the model allowed exploration of this notion of distributed rivalry. To clarify the motivation for the current model, a brief overview of the literature on binocular rivalry is given.


Binocular rivalry

The phenomenon of binocular rivalry was documented as early as 1760 by Dutour (O'Shea, 1999). In the late 19th and early 20th century, it was argued by Hermann Ludwig Ferdinand von Helmholtz, William James, and Charles Sherrington that what was rivalling in binocular rivalry was the stimuli, that is, it was argued that rivalry was a high-level process where two fully processed interpretations of the retinal images competed for conscious attention. Levelt (1965) introduced the idea that rivalry was a low-level process, that rivalry was between the eyes, not the stimuli, and that what was rivalling was image primitives. The discovery by Fox and Rasche (1969) that an image is suppressed for shorter time periods as its contrast increases, prompted Blake to look for the threshold of contrast for which rivalry will occur (Blake, 1977). Such a threshold was found, and to explain these and other findings, Blake (1989) posited the existence of binocular and monocular neurons. That is, it was proposed that there are neurons that receive excitatory input from the left eye, neurons that receive excitatory input from the right eye, and neurons that receive excitatory input from both eyes. Rivalry was seen as a result of "reciprocal inhibition between feature-detecting neurons in early vision" (Blake, 2001, p. 27).

Both "stimulus rivalry" and "eye" rivalry are now used as explanatory models. Evidence exists to support both theories, but the evidence seems contradictory. Blake, Westendorf and Overton (1980) showed that if the two stimuli are swapped just as one has become dominant, the dominant stimuli will be rendered invisible, and the suppressed one will be seen. This result supports the notion of "eye" rivalry. Counter to this finding, Logothetis, Leopold, and Sheinberg (1996) found that if both rival targets were flickered at 18 Hz and exchanged between the two eyes every 333 ms, observers would still report dominance of one stimuli lasting for seconds, indicating that rivalry happens between stimuli, not between the eyes. To investigate the generality of this finding, Lee and Blake (1999) examined rivalry at slower exchange rates and lower spatial frequency gratings than those used by Logothetis et al. (1996), and found that "stimulus rivalry" was dependent on the 18 Hz flicker. Nevertheless, still more evidence for "stimulus rivalry" was reported by Logothetis and co-workers. Whereas activity of neurons in V1 does not seem to correlate strongly with perceived (i.e., dominant) stimulus (Leopold and Logothetis, 1996), neurons in temporal areas thought to be involved in complex object recognition seem to be in synchrony with the monkey's reports of stimuli dominance (Sheinberg and Logothetis, 1997). These findings have prompted Blake (2001) to argue that the neural substrates of rivalry might need to be rethought. Clearly, the discrepancies between psychophysical studies and neurobiological evidence open the possibility for a third kind of model, where rivalry is not seen as occurring at any one place, but is rather spread out through the visual information processing systems of the brain.

It has been proposed that rivalry occur at multiple stages (Blake, 1995), or even that it is "an oversimplification to speak of rivalry 'occurring' at any one particular neural locus" (Blake, 2001, p. 32). Blake (2001) cites several brain imaging studies that seem to show that rivalry can be detected all through the visual system, with the traces of rivalry being stronger at the higher stages. Thus, in the current paper, a neural network architecture is suggested in which such restraints are not imposed; rather, the model will allow for rivalry both early and late in the visual processing. The aim of the study was to explore whether or not such a model could produce predictable outcomes consistent with well-known characteristics of binocular rivalry. It was hypothesised that rivalry would be found at several stages throughout the information processing, and that more complex stimuli would show more evidence of rivalry higher up in the processing "hierarchy."

Owing to time constraints, only an abridged version of the model, implemented by Self-Organising Maps (SOMs), could be investigated. An overview of SOMs is given next, before the full model, the abridged model, and appropriately revised aims and hypothesis are presented.



Self-Organising Maps (SOMs)

Self-Organising Maps (SOMs – also called Kohonen networks) (Kohonen, 1995) were chosen for the present model of the striate for two main reasons. First, SOMs are a type of unsupervised neural networks (NNs); that is, they learn without explicit feedback as to what constitutes correctness. This seems a primary constraint when choosing a NN to model any part of a biological brain. Second, SOMs categorises the input in a fashion that seems analogous to the ocular dominance patterns seen in primary visual cortex (Wolf, Bauer, Pawelzik, & Geisel, 1996) , and several researchers have used this analogy to investigate and predict structure and function of structures in this area of the brain (e.g., Mitchison & Swindale, 1999; Riesenhuber, Bauer, Brockmann, & Geisel, 1998; Wiemer, Burwick, & von Seelen, 2000) . In the present model, however, each node (i.e., artificial neuron) must be seen as analogous to a whole cluster of neurons, rather than to a single or a few biological neurons.  


The model

The model itself will be described in the next two sections. The first section describes the unabridged model using mostly biological terms. The second section describes the drastically cut-down version that was implemented; these cut-downs were essential to be able to complete a working computer implementation within the given time-constraints. This cut-down version is called the abridged model, and is described from a more implementation-centric point of view.

The unabridged model. The fully fledged model consists of four layers, each feeding into the next layer, but also receiving reciprocal connections from one, or more, higher layers (see Figure 1). Some lower layers also project to several higher layers. Each layer is a separate kind of network, corresponding to the kind of network found in wetware. The “eyes / Lateral Geniculate Nucleus (LGN)” is the input layer, and consist of i = 2 * (x * y) inputs; that is, the two eyes receive images represented as pixels on a Cartesian plane. This structure sends its output to both the striate cortex (V1) and the extra striate layer (V+). Consistent with current neurobiological understanding, V1 is a self-organising network comprised of both monocular and binocular neurons. That is, V1 has an input layer with neurons that receive connections from either the left eye or the right eye, and a hidden layer that receives input from any number of these input neurons. V1 is topographically organised. For simplicity, it is assumed that about 1/3rd of the neurons in V1 are binocular, and that the other 2/3rds are split evenly between the two eyes (Figure 2). Output from the eyes / LGN to V1 will alternate between left and right eye information, mimicking ocular dominance columns (Figure 2). The real-life massive divergence from the eyes / LGN to V1 implies that V1 will consist of more neurons than the earlier structure. In the present model, the number of neurons will be in the vicinity of 100 * i. From V1 onwards, information is projected to the extra striate (V+). That is, the extra striate receives projections from both the eye / LGN-structure, and V1. V+ will have a bigger proportion of binocular neurons than V1; about 70% is assumed. V+ will project to the IT area, which represents the more complex memory of the model. Finally, the IT area projects back to the LGN, modulating its output. Thus it can be seen that the unabridged model, consisting of four different kinds of neural networks where information can flow in several directions, is a fairly complex entity. A much abridged and somewhat revised model was designed as a first approximation of the unabridged model.

The abridged model. The model parts were collapsed into two different neural networks. The eyes / LGN were collapsed into one structure dubbed the striate; the extra-striate cortex and the IT area were collapsed into a second structure dubbed the IT.

The striate was modelled using three self-organising networks (SOMs), one representing neurons that only get information from the left eye, one representing neurons that only get information from the right eye, and one representing binocular neurons that get information from both eyes (Figure 3). Each SOM comprised 900 nodes, where one node represented a cluster of neurons, and each node had 1600 (40x40) connection weights, where the connection weights, when seen as a whole for one node, represented the stimuli for which that node was most likely to “fire”.

The IT area was modelled using a Hopfield network, but due to uncertainty about the test results from the IT, this part of the model will not be discussed further in the present report.

Thus, the revised aim of the present study was to investigate the abridged model of the visual system, exploring what sort of input would rival, if any, in the striate area of the model, as a preliminary for a bigger study in which rivalry would be investigated in the whole model. It was hypothesised that incongruent, simple-shape pictures would rival in the striate whereas incongruent, complex shapes would not. No congruent pictures were expected to rival.




The hybrid neural network was programmed in C++ to allow the authors maximum flexibility in designing the network, and because such an implementation would run faster than any off-the-shelf software (Appendix A). Training and testing of the network was performed on a 2.44 GHz, off-the-shelf IBM compatible single-processor personal computer with 1 GB of memory.

For training 40 black and white 50x40 pixel pictures were used, 20 of which represented simple geometric shapes (e.g., a triangle, a circle, a square), and 20 of which represented more complex shapes (e.g., a triangle and a square, a “house”).

For testing, 160 pairs of 50x40 pixel black and white pictures divided into 8 (23) groups were used (Table 1). Groupings were based on whether or not the pictures were familiar to the networks (i.e., they were part of the training set), whether the pictures were simple or complex, and whether or not the pictures presented to the two “eyes” were the same picture or different pictures (i.e., rivalry was expected for the second but not the first condition).



Before training, all weights were assigned pseudo-random1 numbers between 0 and 1. During training, the 40 training pictures were presented to the three SOMs in random order for 7,000, 10,000, and 30,000 iterations on three consecutive runs. Out of the 50 horizontal pixels for each of the 40 lines, the 40 leftmost pixels were input to the left-SOM, the 40 rightmost pixels were input to the right-SOM, and the 30 pixels in the middle were input to the binocular-SOM (Figure 4). To make the binocular pictures the same size as the left and right pictures, five black (0) pixels were padded on each side of every line (Figure 5). In this way, the network was trained to expect congruent information to the two “eyes.”

During testing, the 160 pairs of testing pictures were shown to the three trained networks (left-SOM, binocular-SOM, and right-SOM). The leftmost part of the first picture was fed to the left-SOM, the rightmost part of the second picture was fed to the right-SOM. For the binocular-SOM, the 30 middle pixels of the two pictures were merged in the following way. If two corresponding pixels were the same (i.e., either both black or both white), the pixel was fed as-is to the binocular-SOM. Otherwise, a random number between the numbers representing black (0) and white (1) was created2.

In response to each pair of pictures, the three networks outputted the Best Matching Unit (BMU): the node that most closely matched the input, where closeness was measured by Euclidian distance. The Euclidian distance between the left and the binocular BMU was subsequently calculated, and the Euclidian distance between the right and the binocular BMU was calculated. Finally, the absolute difference between these two numbers was taken; this was called the Rivalry Score (RS). A high RS would mean that the two “eyes” — the left and the right SOM — were seeing different things, and were thus “rivalling.” 



Results were similar for the 7,000, 10,000 and 30,000 iteration trials, and were therefore collated. A one-way between-subjects ANOVA revealed a main effect of group, F(7,152) = 4.1862, p < .001. Post-hoc comparisons using the Tukey HSD test revealed that incongruent pictures of unfamiliar, simple pictures (group 4) scored higher on the RS than all congruent pictures (groups 1, 2, 5, 6, p < .05 for all groups) (Figure 6). Group 4 pictures also scored higher on the RS than incongruent, familiar, simple pictures (group 3, p < .05), hence the unfamiliarity of the pictures seemed to add to the score.  



The results supported the hypothesis that incongruent, simple shapes would rival in the model striate, whereas complex shapes would not. Also, as expected, no congruent pictures rivalled. These results are consistent with the broader hypothesis that rivalry will be found at several stages (Blake, 1995) throughout the information processing stream, and that more complex stimuli will show more evidence of rivalry higher up in the processing "hierarchy," even though it remains to be seen what will happen further up the processing stream in a fuller version of the model.

The raw Rivalry Score (RS) used in the present study cannot signal whether or not rivalry “actually” occurred in the model. There was no threshold level, over which rivalry could be said to occur, and there was no final judge to decide whether or not rivalry had really occurred; all that could be stated was that the RS was significantly higher for some groups of pictures than for others. This admittedly vague definition as to whether or not the model displays rivalry mirrors Blake’s concern that it is "an oversimplification to speak of rivalry 'occurring' at any one particular neural locus" (2001, p. 32). Thus, rather than picking a relatively arbitrary cut-off point for what constituted rivalry in the model, it was decided to look for traces or evidence of rivalry.

It is worth while tracing what is happening in the model striate at depth, to gain an understanding of what is actually happening in the model, and to point out a plausible different interpretation of the results. As described above, the three SOMs had 900 nodes (or neurons) each and each node could represent a whole 40x40 input picture by its weights. Given that the training set only consisted of 40 different pictures, it is not only conceivable, but quite likely, that the SOMs never abstracted features from the inputs, but instead came to have nodes that represented every training picture perfectly. Thus, at testing time, when the left and right SOMs were asked to return the node that matched best with the test input (the BMU), they returned, in the case of familiar pictures, exactly the same picture as they were given, since this had already been perfectly learned (and a picture will always have the least Euclidian distance from itself, i.e., 0).

In the case of congruent pictures, the left and right pictures would be almost3 the same and the binocular picture would have approximately the same 30 middle pixels, plus five black pixels on both sides for every line. Thus, the Euclidian distance between the left and the binocular picture would mostly come from the padding on both sides of the binocular picture; and likewise for the distance between the binocular and the right picture. Since the left and right pictures in this case are almost identical, the distance left – binocular and the distance right – binocular should be approximately the same, and the RS would thus be near zero, i.e., no rivalry. It is interesting to note that the binocular-SOM could act as a sort of judge: if the left and right pictures are different, then the picture with the least Euclidian distance to the binocular picture could be seen as the dominant at a particular point in time. (Time is not incorporated into the present model.)

In the case of incongruent pictures then, the left-BMU and the right-BMU would be the two different test pictures themselves (minus the missing left or right edge). The binocular-BMU would be the node that matched most closely with a picture comprising of the left test picture on the left side and the right test picture on the right side, and random pixels in the middle (except where the BMUs are overlapping by chance). This would be a node on the binocular self-organising map somewhere between the nodes representing the left input and the right input. More specifically, this node would probably be about halfway between the left-representing and the right-representing node4. It is not obvious then, why RS should be high under this condition. The right side of the binocular BMU should be as different from the left BMU as the left side of the binocular BMU from the right BMU, approximately. In other words, the difference between the left BMU and the binocular BMU should be fairly high, and the distance between the right BMU and the binocular BMU should be fairly high, but these two distances should be approximately the same, cancelling each other out and giving a RS near 0. This line of reasoning suggests that RS might not be as good a measure of rivalry as initially thought. The results of the present study could then be explained as follows. The model behaves as just outlined for both congruent stimuli and complex incongruent stimuli, explaining that complex incongruent stimuli (groups 7, 8) did not rival (when RS is the measurement of rivalry) in the striate in the present study. However, that simple incongruent stimuli did rival could be due to the way the Euclidian distance is calculated. Euclidian distance is calculated by comparing the two pictures pixel-by-pixel. All pixels in the current study would have a value between 0 and 1, and the Euclidian distance is the accumulated differences between pixels. Thus, complex pictures would have more pixels that just happened to match (be near each other value-wise), as drawing-colour in the present study was white (1) and background colour was black (0). Simple pictures, on the other hand, would have large stretched of background only broken by a few pixels of white, and the probability of two white pixels overlapping (when comparing two pictures) would be quite small, thus, most white pixels would be counted as a whole 1 difference, and the accumulated sum of differences would grow quite fast. That incongruent, unfamiliar, simple pictures (group 4) had higher RS scores than incongruent, familiar, simple pictures (group 3) seems more difficult to explain; the present author has yet to figure out what elements of the model causes this behaviour.

It might be concluded that even though the model at first glance produced results quite consistent with the expectations and hypothesis, caution should be used when interpreting these results. A thorough investigation of the model has provided an explanation for the results that should make one highly reluctant to draw any conclusions about the biological basis of binocular rivalry, based on this particular model.

To rectify some of the problems mentioned above with the current model, future researchers might test the current model with fewer neurons (or nodes) in each SOM, giving the SOMs opportunity to abstract over several cases. Pictures could be given different background colours so that simple pictures would not automatically be very different from each other. Also, other measures than Euclidian distance could be used for testing, maybe other measures might be better for detecting rivalry. Further studies might extend the computer implementation to comprise the unabridged model; fewer neurons in each layer (compared to the number of training pictures), but more layers would render the implementation more biologically realistic. The assumption (implicit in this study) that nodes in SOMs can represent whole clusters of neurons without loss of fidelity should be examined, and if it is found not to hold, the “fewer neurons in more layers”-strategy could be the solution for this is as well. An extended model would incorporate time into its structure, allowing researchers to investigate whether or not the model displays alternations in visibility of images, which is characteristic for binocular rivalry in humans. This would also allow the comparison of the model performance with other data on binocular rivalry reported in the literature.


Blake, R. (1977). Threshold conditions for binocular rivalry. Journal of Experimental Psychology: Human Perception and Performance, 3, pp. 251-257.

Blake, R., (1989). A neural theory of binocular rivalry. Psychological Review, 96, pp.145-167.

Blake, R. (1995). Psychoanatomical strategies for studying human vision. In Early vision and beyond, T. Papathomas, C. Chubb, E. Kowler, A. Gorea (Eds.). MIT Press, Cambridge.

Blake, R. (2001). Primer on binocular rivalry, including controversial issues. Brain and Mind, 2, pp. 5-38.

Blake, R., Westendorf, D., & Overton, R. (1980). What is suppressed during binocular rivalry? Perception, 9, pp. 223-231.

Dutour, E. F. (1760). Discussion d'une question d'optique [Discussion on a question of optics]. l'Académie des Sciences. Mémoires de Mathématique et de physique présentés par Divers Savants, 3, pp.514-530.

Fox, R., & Rasche, F. (1969). Binocular rivalry and reciprocal inhibition. Perception and Psychophysics, 5, pp. 215-217.

Kohonen, T. (1995). Self-organizing maps. Berlin ; New York: Springer.

Lee, S. H., & Blake, R. (1999). Rival ideas about binocular rivalry. Vision Research, 39, pp. 1447-1454.

Leopold, D. A., & Logothetis, N. K (1996). Activity changes in early visual cortex reflect monkeys' percepts during binocular rivalry. Nature, 379, p. 549.

Levelt, W. (1965). On binocular rivalry. Soesterberg, The Netherlands, Institute for Perception RVO-TNO.

Liu, L. Tyler, C, & Schor, C. (1992). Failure of rivalry at low contrast: Evidence of a suprathreshold binocular summation. Nature, 379, pp. 549-554.

Logothetis, N. K., Leopold, D. A., & Sheinberg, D. L. (1996). What is rivalling during binocular rivalry? Nature, 380, pp. 621-624.

Mitchison, G. J., & Swindale, N. V. (1999). Can Hebbian volume learning explain discontinuities in cortical maps? Neural Computing, 11(7), 1519-1526.

O'Shea, R. P. (1999). Translation of Dutour (1760) [On-line]. Available: http://psy.otago.ac.nz/r_oshea/dutour60.html

Riesenhuber, M., Bauer, H., Brockmann, D., & Geisel, T. (1998). Breaking rotational symmetry in a self-organizing map model for orientation map development. Neural Computing, 10(3), 717-730.

Sheinberg, D. L., & Logothetis, N. K. (1997). The role of temporal cortical areas in perceptual organization. Proceedings of the National Academy of Science, USA, 94, pp. 3408-3413.

Wiemer, J., Burwick, T., & von Seelen, W. (2000). Self-organizing maps for visual feature representation based on natural binocular stimuli. Biological Cybernetics, 82(2), 97-110.

Wolf, F., Bauer, H. U., Pawelzik, K., & Geisel, T. (1996). Organization of the visual cortex. Nature, 382(6589), 306-307.


Appendix A: Code

Several version of the MSVC++ project including C++ code can be downloaded from http://www.stenmorten.com/CogSci/Code.htm . The latest version is ftp://ftp.stenmorten.f2s.com/www.stenmorten.f2s.com/KmZB/Brain030903_2114.zip



1. For the present purposes, the pseudo-random numbers created by the algorithm in C++ are “random enough”, that is, they have the needed mathematical properties, and hence they will be referred to as random from here on.

2. Taking the average of the two numbers would be presupposing that opposing information is merged in the early parts of the visual processing system, something that is definitively not supported by the empirical evidence; anyway, in the present model, the average would always be (1+0)/2 =0.5. ANDing the numbers together would always render the pixel invisible (0) – effectively saying that if one eye cannot see it, then the other eye cannot see it either (so if the model “closed one eye”, it would be blind). Conversely, ORing the numbers would always render the pixel on (or 1).

3. The pictures are not exactly the same as the test inputs, because the left-SOM only learned the 40 leftmost of the full 50 pixels of every training picture, and the right-SOM only learned the 40 rightmost pixels.

4. This is because of the way the SOMs learn during training. When a training picture is presented to the SOM, the BMU is found and the Euclidian distance between the training picture and the BMU is calculated. Then, in a circular fashion, the nodes around the BMU are all updated to be more like the training picture, with the BMU itself being made most like the training picture and the nodes around it less so, the training factor decreasing as the distance between the BMU and the node to be trained increases.


Table 1

Test picture groups                                                             


Group              Congruent                    Familiar            Simple


1                      v                                  v                      v

2                      v                                                          v

3                                                          v                      v

4                                                                                  v

5                      v                                  v

6                      v                                             

7                                                          v






Figure Captions


Figure 1. High-level view of the model. Thick arrows indicate main flow of information. Arrows from left to right represent information flowing "up" in the processing "hierarchy", and is assumed to get more complex as it goes. Arrows from right to left indicate recurrent connections. i sends information to both V1 and V+, and receives recurrent connections from both V1 and IT. Each layer is a separate kind of network corresponding to the type of network found in wetware.

Figure 2. The eyes/LGN; here shown with i = 2*(3) for simplicity. This structure consists of approximately 50% binocular neurons. Sub-layer 1 is monocular; whereas about half of the neurons of sub-layer 2 to m are binocular. Information flows from left to right. Only excitatory connections are shown.


Figure 3. The abridged network. Three self-organising maps (SOMs) representing neurons of the left eye, neurons of the right eye, and binocular neurons, feed into the Hopfield network, representing the inferotemporal (IT) cortex.


Figure 4. Training input. The left SOM, the right SOM, and the binocular SOM are trained on different parts of the same pictures.


Figure 5. The binocular SOM late in training. Every trained node in the binocular SOM has black (0) padding on both sides.


Figure 6. Mean Rivalry Score (RS) for each group of stimuli.


Figure 1.




Figure 2.




Figure 3.


Figure 4.


Figure 5.


Figure 6.