gomoku_rl.policy.ppo module

class gomoku_rl.policy.ppo.PPO(cfg: DictConfig, action_spec: DiscreteTensorSpec, observation_spec: TensorSpec, device: device | str | int | None = 'cuda')[source]

Bases: Policy

eval()[source]: Sets the policy to evaluation mode.

learn(data: TensorDict)[source]

Updates the policy based on a batch of data.

Parameters:: data (TensorDict) – A batch of data typically including observations, actions, rewards, and next observations.
Returns:: A dictionary containing information about the learning step, such as loss values.
Return type:: Dict

load_state_dict(state_dict: Dict)[source]

Loads the policy state from a dictionary.

Parameters:: state_dict (Dict) – the state of the policy.

state_dict() → Dict[source]

Returns the state of the policy as a dictionary.

Returns:: the state of the policy.
Return type:: Dict

train()[source]: Sets the policy to training mode.