P-VMDNN @ ICDL-EPIROB 2017
Last September, I attended the ICDL-EPIROB 2017 which was held in Lisbon, Portugal. I really enjoyed the conference and the beautiful city and I also had a great time with my old & new friends :)
This time, I was lucky to have an opportunity for the oral presentation. I had my presentation on my recent study on visuomotor learning, particularly on P-VMDNN (Predictive Visuo-Motor Deep Dynamic Neural Network).
The manuscript, the presentation slides (pdf) and the supplementary videos that I used during my talk can be downloaded from below:
During/after my presentation, I received many insightful questions about the model. It might be helpful to share those questions (and my answers) for those who are interested in the model. Below, you can find those questions and my short answers. Some of the questions might be examined more deeply in the future.
In this study, I haven’t dealt with the developmental aspect of imitation. Instead, the robot was explicitly tutored by the experimenter (kinesthetic teaching). In the future study, the developmental aspects will be also considered.
During the training process, the model was trained to generate 1-step look-ahead predictions with BPTT (Backpropagation through time) algorithm on TensorFlow. The model’s learnable parameters (called variables in TensorFlow) including weights, biases and the initial states were optimized to minimize the prediction error at both pathways. Note that the different initial states are obtained for each training dataset.
I received a lot about this issue. Currently, the P-VMDNN model only has lateral connection at the highest level layers. Recently, the model was extended further to have the lateral connections between the different levels. Since it is still under-reviewed for publication, I can’t explain in detail here. But according to Fuster’s model, such lateral connections seem also biologically plausible.
Size of the Model. Please see the Manuscript.
Training Time. I ran the example source code (which will be available soon!). In my computer (Intel i7, 32GB RAM, GTX 1080Ti), training the model with 2 visuomotor patterns for 100,000 epochs took me 11 hours.
Computational Cost. Training is not demanding, but online prediction error minimization is a bit difficult.
If I just feed the visual observation to the input of the model, then the model’s neuronal activation will be driven by those values (i.e. sensory entrainment). We have a different perspective on visual perception. In our model, perception is considered as an active action of updating the internal states while minimizing prediction error. In other words, model’s neuronal activation is driven by the error minimization mechanism. It might be also interested to investigate the balance between these two approaches.
The visual prediction is generated in the visual pathway which is based on the P-MSTRNN (Predictive Multiple Spatio-Temporal Scales RNN). A bit more specifically, the convolution operation with padding is conducted. For more information about P-MSTRNN, please refer this paper.
Both pathways. While the visuomotor pattern is processed along the hierarchy (from lower to higher level), the model extracts the latent features in the pattern. That is, the lower level layers encode specific information whereas the higher level layers encode abstract information of the visuo-proprioceptive patterns.
The internal states refer the neuron’s values (sum of inputs) before the activation function. Particularly, the internal states of the very first time step is called the initial states. The initial states are obtained from the training process. After the training, each training data has different initial states. So if you set the specific initial states, the model will generate the corresponding visuomotor sequence. Therefore, the initial states can be interpreted as intention sometimes.
Yes. If you set the initial states in the higher level layers only, it might not work well. In the extension of this study, it was found that the higher level layer was encoding the type of gesture while the lower level layer was encoding the specific appearance of the human subject. In this case, setting the initial states in the entire layer is more natural. It is like telling the model about what and whose gesture to mentally simulate.
What would happen if I just choose random/neutral initial states that are far from the obtained ones?
It is highly likely that the model will not mentally simulate the visuo-proprioceptive patterns properly. But also, there is a chance that some kinds of novel visual & proprioceptive predictions can be made. I’ve been working on the similar topic and this will be dealt more thoroughly later.
Well, I think that the model can be extended to more complex tasks. Indeed, the extension of this study is under-review so I can’t describe in detail right now.. But in that study, the model was successfully extended to a more complex task including a different number of human subjects’ gestures.
Is the source code available?
Now I’m working on it and it’d be soon available! Yes. Now the draft version of the code is available on https://github.com/mulkkyul/p-vmdnn_tensorflow