Hi I was wondering if I could get a pointer to the implementation of the --cb k --cb_type dr
in the source code? Basically I am trying to understand the parameters that are learnt at the end of off-policy CB training in VW. E.g. I did
vw --cb 3 --cb_type ips -f cb.model -d train.txt --invert_hash readable_ips.model
vw --cb 3 --cb_type dm -f cb.model -d train.txt --invert_hash readable_dm.model
vw --cb 3 --cb_type dr -f cb.model -d train.txt --invert_hash readable_dr.model
and the dr
model obviously contains parameters equal to ips
+ dm
but I want to know exactly what is the linear regression formula that is being implemented in dr
.
--cats
? have you experimented with that at all? For cats I would try different combinations of number of discrete actions used by the algorithm (passed in to the --cats arg) and bandwidths (bandwidth being a property of the continuous range). e.g. I would try a grid of num_actions [8, 16, 32, 64, 128, 256, 1024] and e.g. bandwidths [1, 2, 4, 6, 8, 10, 14, 20]. For different number of discrete actions you might need more data for CATS to converge to something sensible. CATS label support in pyvw should be available in the next release (coming soon-ish, we don't want to wait another year for the next vw release). Let me know if you get better results from CATS or not :)
prob(new_policy)/prob(logging_policy)
, but isn't this only for when we use IPS? I think I'm missing something quite obvious here...3.In order to use -explore_eval I have to convert my data from cb format to cb_adf format since the cb format is not supported when using -explore_eval. For the example data with two arms below, are the two ways to represent the data equivalent?:
2:10.02:0.5 | x0:0.47 x1:0.84 x2:0.29
1:8.90:0.5 | x0:0.51 x1:0.65 x2:0.67shared | x0:0.47 x1:0.84 x2:0.29
| a1
0:10.02:0.5 | a2shared | x0:0.51 x1:0.65 x2:0.67
0:8.90:0.5 | a1
| a2
cb_type ips/dm/dr
and choosing the one with the best reported loss. Isn't that wrong, especially considering dm is biased? --eval throws an error if you use DM.