### 统计代写|强化学习作业代写Reinforcement Learning代考|On-Policy SARSA

statistics-lab™ 为您的留学生涯保驾护航 在代写强化学习Reinforcement Learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写强化学习Reinforcement Learning代写方面经验极为丰富，各种代写强化学习Reinforcement Learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 统计代写|强化学习作业代写Reinforcement Learning代考|On-Policy SARSA

Like the MC control methods, we will again leverage GPI. We will use a TD-driven approach for the policy value estimation/prediction step and will continue to use greedy maximization for the policy improvement. Just like with $\mathrm{MC}$, we need to explore enough and visit all the states an infinite number of times to find an optimal policy. Similar to the MC method, we can use a $\varepsilon$-greedy policy and slowly reduce the $\varepsilon$ value to zero, i.e., for the limit bring down the exploration to zero.
TD setup is model-free; i.e., we have no prior comprehensive knowledge of transitions. At the same time, to be able to maximize the return by choosing the right actions, we need to know the state-action values $Q(S, A)$. We can reformulate TD estimation from equation (4.4) to the one in (4.6), essentially replacing $V(s)$ with $Q(s, a)$. Both the setups are Markov processes, with equation (4.4) focusing on state-to-state transitions and now the focus being state-action to state-action.

$$Q\left(S_{t}, A_{t}\right)=Q\left(S_{t}, A_{t}\right)+\alpha *\left[R_{t+1}+\gamma * Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)\right]$$
Similar to equation (4.5), the TD error is now given in context of q-values.
$$\delta_{t}=R_{t+1}+\gamma * Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)$$
To carry out an update as per equation (4.6), we need all the five values $S_{t}, A_{t} R_{t+1}$, $S_{t+1}$, and $A_{t+1}$. This is the reason the approach is called SARSA (state, action, reward, state, action). We will follow a $\varepsilon$-greedy policy to generate the samples, update the q-values using (4.6), and then based on the updated q-values create a new $\varepsilon$-greedy. The policy improvement theorem guarantees that the new policy will be better than the old policy unless the old policy was already optimal. Of course, for the guarantee to hold, we need to bring down the exploration probability $\varepsilon$ to zero in the limit.

Please also note that for all episodic policies, the terminal states have $Q(S, A)$ equal to zero; i.e., once in terminal state, the person cannot transition anywhere and will keep getting a reward of zero. This is another way of saying that the episode ends and $Q(S, A)$ is zero for all terminal states. Therefore, when $S_{t+1}$ is a terminal state, equation (4.6) will have $Q\left(S_{t+1}, A_{t+1}\right)=0$, and the update equation will look like this:
$$Q\left(S_{t}, A_{t}\right)=Q\left(S_{t}, A_{t}\right)+\alpha *\left[R_{t+1}-Q\left(S_{t}, A_{t}\right)\right]$$
Let’s now look at the pseudocode of the SARSA algorithm; see Figure 4-13.

## 统计代写|强化学习作业代写Reinforcement Learning代考|An Off-Policy TD Control

In SARSA, we used the samples with the values $S, A, R, S^{\prime}$, and $A^{\prime}$ that were generated by the following policy. Action $A^{\prime}$ from state $S^{\prime}$ was produced using the $\varepsilon$-greedy policy, the same policy that was then improved in the “improvement” step of GPI. However, instead of generating $A^{\prime}$ from the policy, what if we looked at all the $Q\left(S, A^{\prime}\right)$ and chose the action $A^{\prime}$, which maximizes the value of $\mathrm{Q}\left(\mathrm{S}^{\prime}, \mathrm{A}^{\prime}\right)$ across actions $A^{\prime}$ available in state $S^{\prime}$ ? We could continue to generate the samples $\left(S, A, R, S^{\prime}\right.$ ) (notice no $A^{\prime}$ as the fifth value in this tuple) using an exploratory policy like $\varepsilon$-greedy. However, we improve the policy by choosing $A^{\prime}$ to be $\operatorname{argmax}{A} Q\left(S, A^{\prime}\right)$. This small change in the approach creates a new way to learn the optimal policy called $Q$-learning. It is no more an on-policy learning, rather an off-policy control method where the samples $\left(S, A, R, S^{\prime}\right)$ are being generated by an exploratory policy, while we maximize $Q\left(S^{\prime}, A\right)$ to find a deterministic optimal target policy. We are using exploration with the $\varepsilon$-greedy policy to generate the samples $(S, A, R, S)$. At the same time, we are exploiting the existing knowledge by finding the $\mathrm{Q}$ maximizing action $\operatorname{argmax}{A} \cdot Q\left(S^{\prime}, A^{\prime}\right)$ in state $S^{\prime}$. We will have lot more to say about these trade-offs between exploration and exploitation in Chapter $9 .$
The update rule for $\mathrm{q}$-values is now defined as follows:
$$Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha *\left[R_{t+1}+\gamma * \max {A{t+1}} Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)\right]$$
Comparing the previous equation with equation (4.8), you will notice the subtle difference between the two approaches and how that makes Q-learning an off-policy method. The off-policy behavior of Q-learning is handy, and it makes the sample efficient. We will touch upon this in a later section when we talk about experience replay or replay buffer. Figure 4-15 gives the pseudocode of Q-learning.

## 统计代写|强化学习作业代写Reinforcement Learning代考|Maximization Bias and Double Learning

If you look back at equation (4.10), you will notice that we are maximizing over $A$ ‘ to get the max value $Q\left(S, A^{\prime}\right)$. Similarly, in SARSA, we find a new $\varepsilon$-greedy policy that is also maximizing over $Q$ to get the action with highest q-value. Further, these q-values are estimates themselves of the true state-action values. In summary, we are using a max over the q-estimate as an “estimate” of the maximum value. Such an approach of “max of estimate” as an “estimate of max” introduces a +ve bias.

To see this, consider a scenario where the reward in some transition takes three values: $5,0,+5$ with an equal probability of $1 / 3$ for each value. The expected reward is zero, but the moment we see $\mathrm{a}+5$, we take that as part of the maximization, and then it never comes down. So, $+5$ becomes an estimate of the true reward that otherwise in expectation is 0 . This is a positive bias introduced due to maximization step.

One of the ways to remove the +ve bias is to use a set of two $q$-values. One $q$-value is used to find the action that maximizes the q-value, and the other set of q-values is then used to find the q-value for that max action. Mathematically, it can be represented as follows:
Replace $\max \boldsymbol{A Q}(\boldsymbol{S}, \boldsymbol{A})$ with $Q_{1}\left(S, \operatorname{argmax}^{2} Q_{2}(S, A)\right)$.
We are using $Q_{2}$ to find the maximizing action $A$, and then $Q_{1}$ is used to find the maximum q-value. It can be shown that such an approach removes the +ve or maximization bias. We will revisit this concept when we talk about DQN.

## 统计代写|强化学习作业代写Reinforcement Learning代考|On-Policy SARSA

TD 设置是无模型的；即，我们没有关于转换的先验综合知识。同时，为了能够通过选择正确的动作来最大化回报，我们需要知道状态-动作值Q(S,A). 我们可以将方程 (4.4) 中的 TD 估计重新表述为 (4.6) 中的一个，本质上是替换V(s)和Q(s,a). 两种设置都是马尔可夫过程，方程（4.4）关注状态到状态的转换，现在关注的是状态到状态的转换。Q(St,At)=Q(St,At)+α∗[Rt+1+γ∗Q(St+1,At+1)−Q(St,At)]

δt=Rt+1+γ∗Q(St+1,At+1)−Q(St,At)

Q(St,At)=Q(St,At)+α∗[Rt+1−Q(St,At)]

## 统计代写|强化学习作业代写Reinforcement Learning代考|An Off-Policy TD Control

Q(St,At)←Q(St,At)+α∗[Rt+1+γ∗maxAt+1Q(St+1,At+1)−Q(St,At)]

## 有限元方法代写

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

## MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。