As I posted in the mattermost, one of the applications I’ve been working on for applying Transflower, is the robotics task that Tianwei designed and gathered some data for using VR.
After a lot bugs and problems, both with code and data, and fixing and improving stuff, the model seems to be beginning to work and even generalizing somewhat!
What was done
Enhancements/tweaks:
- Simplified observation space to only include the object to interact with, and only included tasks that include one object. This is a significant simplification, but was worth trying! I plan to explore relaxing this simplification
- As we only give it the object it will interact with, I’ve ommited the object color information from the observation, as it’s not necessary to accomplish the simplified task.
- Applied smoothing to the data and trimmed a bit of the beginning (where the user is reading the instrution and doing nothing)
- Played with adding input dropout like in MoGlow to force it to pay more attention to the instructions
- Filtered the data to include only my and Tianwei’s demonstrations. Turns out most other people didn’t have enough practice with the VR system to produce good demonstrations reliably. I could do a more fine-grained filtering, because some of the other demonstrations are actually good, but I’ve done it this way for simpliciy for now.
Fixes:
- I have some evidence from another application that suggests that TorchScript sometimes subtly breaks the model. I have thus stopped using it, but want to investigate further
- found serious bug in code for updating context during inference that would break any model with context length > 1
- probably other bugs I cant remember
More comments: “plateau” learning rate schedule (auto-decrease when loss stops improving) seems to be a nice trick.
So main steps to make it work were: filter bad data, maybe TorchScript broke stuff, and smoothing also helps a lot. It shouldn’t have taken me so long to figure these
Preliminary qualitative look
Here it is accomplishing a task outside the training set (the instruction is found in the training set but not with the particular position of the object)
Paint black bird yellow:
Put blue bird on the shelf:
Those are a bit cherry picked. Here are some comments on the current degree of generalization I observe in the model from manual tests:
- If the configuration is quite different from training set (i.e. a new pair of (object type, object position)) it seems to have a tendency towards painting the object some color, regardless of the given instruction. This may be that painting is the most common instruction in the training set (approximately half of the instrunctions are Paint instructions)
- If the configuration is similar to the one in the training set, it will have a tendency to do the task that was in the training set, specially if initiated in the same arm pose. Not surprising
- Nevertheless, it will still generalize sometimes, achieving new combinations of (object type, object position, goal) that weren’t in the training set. As the model is probabilistic, it may only achieve this a certain fraction of the time, and also for a long enough episode (where it may fumble around until it achieves it), and measuring these for different task/configurations will be necessary to get a better grasp of the generalization.
Next steps
- Try the model with a bit more training and longer context
- Gather more data
- Make automatic evaluation. This would require creating a reward function, either hard-coded or learned, to test how often the model succeeds at achieving tasks
- Try different methods to improve the model (beyond gathering more demo data):
– PPO with hard-coded reward function
– GAIL (basically same as above but with learned reward function)
– Supervised learning on self-generated data with relabelling (like in Laetitia’s project), which is basically Levin’s GCSL algorithm, which in turn is a combination of HER + the self-imitaiton learning approach to RL (see https://arxiv.org/pdf/1912.06088.pdf)
– Some offline-RL algorithm. Actually question: is there a policy-based (not value-based) offline RL algorithm? I’m sure there are but can’t remember one off the top of my head!