a reply to:
neoholographic
People are not talking about a one to one correspondence with human intelligence. In some ways it will be better than human intelligence
because it will not have to deal with human problems.
I am useless compared to the worlds best chess algorithms, doesn't mean I am afraid of those algorithms because they are so much better than me at a
highly specialized problem.
Like I said, all it takes is a simple intelligent algorithm that can replicate itself and the subsequent copies are more intelligent then the
initial algorithm. You could start with a simple intelligent algorithm with the intelligence of a 3rd grader and you will have an explosion of
intelligence as the algorithm replicates itself. It only needs one goal and that's to replicate itself.
You're skipping the whole part about how it gets more intelligent with each generation and in what ways it gets more intelligent. If we had figured
out a way to do it like that then it would already be done. Fundamentally you are correct, it should be possible to create an algorithm which can
self-replicate and evolve over time as any other species does, but the deep Q-network or any other type of deep network is probably not going to be
the way it works.
If you take a gander at the deep Q-network paper it contains a list of games they tested the algorithm on and you will notice there is one game called
Montezuma Revenge right at the bottom of the list where the algorithm made 0% progress in the game. I found an good video explaining
Why Montezuma Revenge doesnt work in DeepMind Arcade Learning Environment which I recommend
watching to understand the limitations of this method.
The issue is once again related to the way the algorithm uses the game score to determine success and failure. The problem with this game is that the
game score remains at 0 until you reach the key at the end of the level, but it's quite hard to actually reach the key in the first place. You have to
first move away from the key and then avoid an enemy while moving back towards the key. A human could tell that it's making progress even when moving
away from the key and we would have absolutely no problem clearing the level.
However the deep Q-network basically never makes any progress on this level because the score never increases and it never knows if it's doing the
right or the wrong thing. Just like when solving a complex real world problem, there is generally no score saying you are getting closer to the
solution. However after doing some thinking on these limitations there may possibly be ways to improve the deep Q-network so it can beat games like
Montezuma Revenge.
A core problem with the deep Q-network is that it lacks long term memory and so it cannot plan for the future. But it would be preferable to solve the
problem without actually changing the way the deep Q-network is designed. One possible solution would be assign points simply for staying alive but
then it would just stay in the same spot and never do anything. Therefore points for staying alive should decay when the agent is doing nothing and it
should get points for making things on the screen change.
One problem with this approach though is that the concept of "staying alive" is just like the score, the agent cannot easily tell when it dies just by
looking at the screen pixels, it needs to be directly told when it dies just like it needs to be directly fed with the score. However the idea of
making pixels on the screen change to get points could be on the right track, but it wouldn't be quite that simple because the agent could just run in
circles without making progress in order to get points.
What we really want to award is novel screen states, that is to say, the agent should get many points when it manages to reach a point in the game it
has never reached before. This could perhaps be achieved by hashing screen states into point buckets where the points decay depending on how many
times each screen state has been observed. So if the agent produces a screen state it has never seen before it gets the most points, but it will get
less each time that same screen state is observed.
This approach could perhaps allow the agent to figure out how to reach the key because it will be rewarded by creating new screen states as it gets
closer to the key. If it continues going the wrong way it will eventually lose any points for trying that route because it will have observed the
losing game states so many times. It's probably not quite that simple though and it would still require the agent to be supplied with the game score,
I doubt encouraging novel screen states would work by its self.
edit on 6/4/2016 by ChaoticOrder because: (no reason given)