Term ELK (eliciting latent knowledge) is used to describe the problem of ML alignment, alignment of ML systems ensure that the ML algorithm returns a positive reward only when the required task is achieved. The problem of alignment arises when we cannot define an exact reward function considering all possible scenarios, when an agent collapses to a specific counterexample obtaining high reward without considering the environment we consider that system to be misaligned.
To validate the alignment of ML systems the framework proposed by the authors includes environment (state-of-the-world), predictor (trained model), reporter (model validator), and human (for questions and feedback). Here, in this framework, it’s assumed the predictor and reporter have complete access to an environment at any given instant of time, the reporter has additional information of predictor’s state-of-the-world (or the world perceived by the predictor) and human interactions. The task for the predictor is to predict the future state of an environment given its past and current states, and the reporter considers human queries about the model or environment and provides answers to them, validating an alignment of the predictor to human values.
Based on reporters’ behaviour they are categorised into direct translators (DT) and human simulators (HS). HS just tries to get the output or utility close to human distribution, while DT considers the state of an environment to estimate utility. In an aligned system, DT outputs the same utility as HS, while considering the state of an environment.
More detailed introduction to ELK can be found here. OpenAI posed this question as an open challenge, expecting to obtain multiple solutions and perspectives from people all around the world. Here, we discuss a few submitted proposals along with our opinion.
- Train a reporter that is useful to an auxiliary AI
- Here, authors encourage the use of auxiliary models as a regularizer to a reporter model, they consider rewarding agents only for learning “useful” information rather than maximising entropy. Author claim to use partial environmental information and a few other alternatives (like, to predict estimate network latent activation, ensemble, and consensus) to train these auxiliary networks. These auxiliary networks have human feedback embedded, given sufficient data and time, auxiliary networks would help DT reporters to align with human values.
- Counterexample: Having an auxiliary network increases the possibility of encoding non-semantic information in reporters’ answers, this can be limited by imposing consistency constraints, which in turn increases computational complexity multi-fold. The proposal claims an easy fix to extreme counterexamples by simply avoiding obviously unrelated questions to reporters, in other terms limiting degrees of freedom for reporters.
- Continuous reporters
- Here, authors encourage reporters to model an environment in a continuous fashion, constantly observing an environment to detect any and every minute change in an environment. The problem here is to concretely define “minute change”. Considering reporters to be monotonic the most trivial way of defining “minute change” is to compare latent activation at time t and t-1. The proposal claims to include regularizers to penalize reporters when isn’t a “minute change”. The defined regularizer can also iterate over human feedback on what parts of the environment are considered as continuous.
-
Counterexample: Both predictor and reporter may have a completely different way to map an environment internally (non-continuous or non-monotonic). We may enforce both models to learn “continuous” representation, but the main issue is the existence of such continuous functions for optimization and the properties of an environment following such functions.
- Sequential training
- This seems to be a very interesting proposal, where authors propose a method to train both predictor and reporter by increasing complexity gradually (models at iteration k and models at iteration k+1, vary in a limiting sense). Authors claim that the complex models up in the hierarchy will be aligned when initial models are aligned. Reporter in initial iteration can be considered as DT, as that would be the easiest way to answer questions for a simple predictor or can be easily constrained by humans to follow human values, and as the complexity of reporter and predictor is gradually increased the alignment property may be preserved.
- Counterexample: In this proposal, it’s quite possible that any DT reporters can gradually get converted into HS. As the complexity of predictor changes continuously at some iteration it may switch modes, the effect of this change is gradually seeped into the reporter forcing it to move from DT to HS.
- This seems to be a very interesting proposal, where authors propose a method to train both predictor and reporter by increasing complexity gradually (models at iteration k and models at iteration k+1, vary in a limiting sense). Authors claim that the complex models up in the hierarchy will be aligned when initial models are aligned. Reporter in initial iteration can be considered as DT, as that would be the easiest way to answer questions for a simple predictor or can be easily constrained by humans to follow human values, and as the complexity of reporter and predictor is gradually increased the alignment property may be preserved.
- Conclusion In conclusion, the idea of ELK is to find all possible regularization methods to encourage reporters to be DT, here, we described a few of those examples. This forces us to think if it’s even possible to develop a reporter which is DT without any counterexamples? the answer may be no. We can always find counterexamples for any given framework, but if those counterexample makes sense or not is a completely different story. We can consider a relaxed version of the alignment problem and develop an EKL solution to that, where the existence of meaningful counterexamples can be minimized.
Leave a Comment