An overview of the applications of reinforcement learning to robot programming: discussion on the literature and the potentials

There has been remarkable progress in the field of robotics over the past few years, whether it is stationary robots that perform dynamically changing tasks in the manufacturing sector or automated guided vehicles for warehouse management or space exploration. The use of artificial intelligence (AI), especially reinforcement learning (RL), has contributed significantly to the success of various robotics tasks, proving that the shift toward intelligent control paradigms is successful and feasible. A fascinating aspect of RL is its ability to function both as low-level controller and as a high-level decision-making tool at the same time. An example of this is the manipulator robot whose task is to guide itself through an environment with irregular and recurrent obstacles. In this scenario, low-level controllers can receive the joint angles and execute smooth motion using the Joint Trajectory controllers. On a higher level, RL can also be used to define complex paths designed to avoid obstacles and self-collisions. An important aspect of successful operation of an AGV is the ability to make timely decisions. When Convolutional Neural Networks (CNN) based networks are incorporated with RL, agents can decide to direct AGVs to the destination effectively, which is mitigating the risk of catastrophic collisions. Even though many of these challenges can be addressed with classical solutions, devising such solutions takes a great deal of time and effort, making this process quite expensive. With an eye on different categories of RL applications to robotics, this study will provide an overview of the use of RL in robotic applications, examining the advantages and disadvantages of state-of-the-art applications. Additionally, we provide a targeted comparative analysis between classical robotics methods and RL-based robotics methods. Along with drawing conclusions from our analysis, an outline of the future possibilities and advancements that may accelerate the progress and autonomy of robotics in the future is provided.


Introduction
With the advancements in the field of AI and robotics, it is captivating to know that many of the tasks that humans perceive for the time being as complex will be replaced by robots.The rapid advancements in this field indicate the swift growth of the industry.The reasoning that drives the aforementioned development can be associated with the requirement of increased production volume associated with the population growth [1].The humanoid robots and Collaborative robots (Cobots) are some recent promising advancements aimed at reducing the gap for increased production.The interesting part, many of these achievements are made possible with the help of AI.From behavior control, Sensor fusion, and object recognition to adaptive control, path planning, and optimization, a multitude of tasks can be handled by implementing AI [7].Though Machine Learning (ML) is a vast field, and many approaches can be used to solve the aforementioned tasks, RL (a subpart of ML) has been emerging as a promising approach for dealing with such tasks.Many experiments have been made with RL in robotics, and it has already managed to solve problems both at a higher level and a lower level.The bipedal walker trained using parametrized RL here [2] is a great example of low-level control.Using the Hybrid Zero Dynamics (HZD) gait library, a set of walking patterns for the bipedal walker (20 joints) are interpreted and learned from using the PPO algorithm.Path planning for AGVs using RL can be set as a reasonable example for high-level planning.Here the intricacies of the control of the robot (joint control) are managed by underlying systems such as Proportional-Integral-Derivative (PID) and the decisions on which path to take are managed by the RL agent.In the experiment stated here [3], the RL agent based on the Improved Dueling Deep Double Q Network algorithm (ID3QN), was used to navigate through a Nuclear environment and the action space for this particular application is discrete.
Overall it can be noted that the general focus on RL with robotics has been on the rise and in this paper, the latest research findings in this field will be discussed.The section on comparative analysis of the advancements with RL and the conventional approaches aims to provide some input on whether the RL methodologies can replace the conventional approaches.Towards the end, with the provided research, a discussion of the further possibilities of using RL to deal with many of the currently redundant tasks dealt with by humans is also included.

Background in Reinforcement Learning
RL as stated by R.S. Sutton in his book [4], is learning what to do, where the agent or the learner is given any instruction on which action to take or how to learn, the learner goes through a series of interactions with the environment to decide on whether a specific action is sensible at given time.When compared to other sectors of ML such as supervised learning (models are trained on labeled datasets with explicit input-output pairs, aiming to generalize patterns and make predictions) or unsupervised learning which finds a particular hidden pattern in a set of given data, RL deals with finding complex decisions by dynamic interaction with the environment and relying on a feedback mechanism of rewards or penalties.
For a better understanding of the need for the reward, an analogy can be used.Assume a cyclist learning to ride a bicycle.In the initial steps, the cyclist falls off the bicycle.Through successive attempts employing diverse pedaling techniques, the learner eventually identifies an effective approach, enabling them to propel the bicycle to a notable distance.In this context, the cyclist falling off the bicycle can be associated with a negative reward while propelling the bicycle a specific distance may be correlated with a favorable outcome, indicating a positive reward.The ultimate idea behind RL is to find an optimal policy, which is a definite set of actions that leads to the maximum possible reward [4].The complete process of interaction with an environment is done using the Markov Decision Process, denoted by < S, A, P, R, γ >.Starting with the timestep t, there is a state S t which is received by the agent from the environment.Depending on the state S t the agent takes an action A t to the environment.This further gives a new state S t+1 and a reward R t+1 [28].
The cumulative reward formed through a series of interactions can be defined mathematically as follows: The discount rate has greater implications in determining whether the agent perceives the immediate rewards as important (γ is smaller) or the distant reward (γ is larger).
This term in RL is the Value Function V π which gives the cumulative reward (1) from a certain state x ∈ X following a specific policy π from that state.The optimal policy which maximizes the value function is represented as π * .
From Bellman's optimality principle, the optimal value function is: Though a variety of approaches are there in RL, one of the widely used methods of solving RL is the actor-critic method which is based on temporal difference learning [5].In this method there are two separate networks, one is the policy (actor) and the other is the Value Network (critic) and they are learned simultaneously.
The procedure of the Actor-Critic (AC) method is as follows: at each time step t, the actor perceives the present system states t and executes an action a t following a policy π.Consequently, a new system state s t+1 is reached along with a numerical reward r t+1 [5].The temporal difference (TD) error at that specific time is then computed utilizing this information as follows: The subsequent TD error is then used to update the estimate of the optimal value function.When the action a t gives a positive TD, δ t > 0, the agent is guided to prefer that action in similar states while the action that leads to a negative TD, δ t < 0, is highly discouraged, thereby avoiding that action in similar states.

Current advancements in RL with robotics
Robotics, when compared to other fields are more complex to optimize due to the larger action space requirement for optimal control [6].It is unrealistic to assume the observation space without a certain degree of noise in the system.Also, the complete state of the robot might not be observable and it has to be modeled as a partially observable state [7].Another problem faced by RL with robotics is sample efficiency where the robot has to go through a series of iterations to get a considerable amount of training data [27].This is very expensive in terms of the cost of running the experiment and the time for gathering the data.The current fast-paced development in the field of simulation helps in dealing with this problem.The research [8] shows some of the methods of Sim to Real transfer with RL for robotics.A variety of simulation methods are compared and many intricacies regarding the transfer are discussed in this paper.Domain randomization, domain adaptation method, learning with disturbances, and simulation to use are some of the topics discussed in the paper.toward the real environment.This can be accredited to the realistic appearance of the simulation with the help of the Ray Tracing effect.The trained agent inside the simulation can be transferred to the real robot with minimal Sim to the Real gap.A few experiments with Nvidia Isaac sim have already been made and proven to be efficient in carrying out the task.An example can be seen here [10].Further, the experiments in [31], based on [10] focus on solving the task of visiting a set of consecutive points with various time constraints (staying at a point for a defined time).Here, various RL algorithms were tested to solve the task, and TRPO was found to be the promising algorithm for the defined task.Though many applications of RL with robots are carried out with the introduction of Proximal Policy Optimization (PPO) [11] which also gave exceptional results in handling complex tasks with a higher degree of state and action pairs, the usage of multi-agent RL as well as imitation learning-based RL for robotics are also being extensively researched.
The implementation provided by [12] is an intriguing method of using Multi-agent RL for controlling a redundant robot.The joints compared to many other RL approaches where all the joints are controlled by a single agent, the author defined an agent for each of the joints.The actor-critic implementation here with two value functions and a kinematic learning approach helped in providing an online inverse kinematics solution without controllability problems.The approach was also verified with Jacobian-based and neural network methods.

Comparative Analysis: Reinforcement Learning vs. Conventional Approaches
A lot of the applications that are presently used with the help of robots are with conventional methods.This applies to both the manipulator robots and the mobile robots.The problem with conventional methods is that devising such a method for solving a task takes a considerable amount of time and effort.This can be a challenge in a dynamic environment where the applications constantly change.
The problem worsens when the same robot has to be used, which further leads to reprogramming the robot rather than using a conventional ready-made task-defined robot.In this section, the aim is to have an overview of the conventional approaches compared to RL approaches for both the mobile and the manipulator robots.The methods would be compared in terms of the feasibility and the effectiveness of the application.The interesting part about using RL is that it can also be used along with the conventional methods, which will also be discussed in the later section.
The PID control is the widely used feedback control mechanism used for stabilizing the system.The three components: potential, integral, and derivative work together to minimize the errors between the desired output and the system's actual output.When mobile robots are examined, navigating through a real environment involves various aspects of detecting and bypassing static and dynamic obstacles.One such example is in [13], where the authors tried controlling the robots by integrating fuzzy logic with the PID Controller.The sensory outputs are passed into a Fuzzy Logic Controller, whose outputs are then passed to the PID controller for a better and more accurate result.Here as well, the low-level control is given more emphasis using the fuzzy logic -PID control mechanism.Another interesting research can be seen here in [14] where non-holonomic constrained mobile wheeled robots are studied inside a simulation for tracking the trajectory.Here, a tracking controller based on Fractional Order PID (FOPID) is used and the parameters for the FOPID are given by using a particle swarm optimization algorithm (PSO).The objective is to remove the disturbances involved in the tracking of trajectories.
When RL approaches are investigated for similar problem statements, i.e. trajectory control of the mobile robot, it is important to note that a lot of advancements are made in this field.The problem with working with controllers for mobile robots such as non-holonomic wheeled robots is that the system is highly non-linear, and the environmental dynamics can greatly affect the control of the wheeled robot [15].In RL itself, various approaches have been made to tackle the problem.The major solution to dealing with such a problem is the use of online training with RL that is capable of adapting to unprecedented disturbances that are encountered by the system.A notable example is [16], where the authors used an actor-critic RL implementation having a concurrent continuous-time adaptation of both actor-critic networks.For the tuning of the networks, a novel algorithm has been used with extra non-standard terms specifically for tuning the actor, which guarantees a closed-loop dynamic stability.With more advanced approaches in RL, an experiment was made by the authors in [17], where mapless navigation is performed by a robot with an agent having a soft actor-critic(SAC) structure.The agent is provided with information about the target from the laser scanning data and outputs the linear and angular velocity necessary for autonomous navigation.This approach, though not online, deals with a lot of the problems currently faced by autonomous navigation systems such as localization, creating a map of obstacles, and further finding an obstacle-free path with various search methods such as A-star search, and Rapidly-exploring Random Tree(RRT).They also require high-precision sensors for accurate estimations.The authors used the SAC method to design a continuous control system that reaches the target position without collision.In the SAC implementation, a squashed Gaussian policy is adopted that ensures the actions are generated by policy within the effective action space using a tanh activation function.When a comparison is made taking into account the above examples in conventional and DRL methods of mobile robot control, the RL-based methods prove to be a lot more efficient for solving the autonomous navigation task.This can be accounted for by its higher adaptability to unknown environments, and efficiency in dealing with variable disturbances.
The manipulator robot, since its inception, has been designed to replace humans in performing repetitive and labor-intensive tasks.Since the tasks vary, the types of robot manipulators are also vivid.For example, the robots used for space exploration are different from the robots used for industrial purposes such as painting and welding.Thus, manipulator robots can be categorized into different types such as serial manipulators [18], parallel manipulators [19], mobile manipulators [20], and cable manipulators [21], among others.Out of them, the widely used ones are serial manipulators, which are a series of links connected by motor-actuated joints and usually have a fixed base that extends to an end effector.The mobile manipulator usually has a platform that is free to move, and a serial manipulator fixed to it.The control strategies for the manipulator robots have also been developed and put to use to a great extent over the years.A lot of the time, the control strategies vary with respect to the type of task involved.For example, a task such as pick and place needs a control strategy that aims more at having a controlled movement of the end effector i.e. regulated acceleration, velocity, and payload management of the end effector while a manipulator robot used for surgery requires a control strategy that focuses on high precision and tactile feedback.The common method of control for these robots is PID control.Similar to [20] for a mobile robot, in [22] the authors worked on a system that tunes the PID controller with the fuzzy logic to track the trajectory of a manipulator robot with 4 DOF.The controller is based on the camera input that determines the movement based on a series of specified colors.Another well-used control approach is adaptive control, where the algorithm aims to adjust the control parameters to accommodate the changes in the robot's dynamics or external conditions.The Model Reference Adaptive control is one of the most used adaptive control methods where a reference model is used, and the controller adjusts itself to match the reference model [23].The advantages of using RL for manipulator robots have increasingly forced researchers to work on novel RL strategies for varying manipulation tasks.Through the introduction of Deep Reinforcement Learning (DRL), especially policy gradient methods [24], the ability to have continuous action space has led to various findings in this field.Interesting research was made by the authors in [25] where the authors designed a closed loop controller, based on a DRL algorithm called Trust Region Optimization (TRPO) where the training takes place inside the simulation.The approximation for the robot dynamic model was obtained using a Long-short Term Memory (LSTM) network.It was inferred that the trained controller was able to execute different paths in the robot workspace with varying velocities.The interesting aspect of using RL in robotics is that it can be used at various levels of control to get the task done.For example, in terms of the manipulator robot, an RL algorithm can be used to directly output the joint positions at each timestep with a closed loop configuration where it receives feedback signals from the sensors to carry out a specific task.At the same time, RL can be used to only provide the end effector positions, dealing with the dynamics of the environment such as avoiding various obstacles and guiding the end effector to the goal positions.RL can also be used to deal with use cases where all of these contexts are taken into consideration.In [26], the authors provided a novel approach to designing a Nonlinear Model Predictive control-based RL framework that gives out the pose of the robot for invading a set of obstacles.Here a parameterized NMPC structure is used to approximate the RL framework's action value function.Here the TD method for the Q-learning algorithm has been used to optimize the parameters of the NMPC, thus increasing the performance of the control scheme.The experiment was carried out on a robot manipulator with 6 Degrees of Freedom and the results from the simulation proved to be effective in devising trajectories that avoid collision between the robot and the obstacles.
The table below provides a general comparison of the Conventional and RL-based approaches infered through various publications and self analysis.

Future Prospects and Possibilities
The provided analysis proves that a strong base already exists for unparalleled advancements in the field of RL with robotics.With the usage of realistic simulations like Nvidia Isaac Sim, it will be a lot more feasible to experiment with highly complex setting and then transfer the trained agent to real robots.The experiment [10] already provides an idea into how the sim to real transfer looks like.An example of a highly complex setting would be a combination of tasks that needs to be done in variable orders to achieve an overall greater task (Such as building a car from scratch).For solving such tasks, Multi Agent Reinforcement Learning (MARL) which constitutes the usage of multiple agents collectively to solve a task can be of great use.In [32], a general overview of the state of the art usage of MARL is provided and it is inferred that a with the help of MARL a central controller is generally not required to manage a multi agent system.A lot of research in the field of MARL underway, not limited to the field of robotics.The paper [33] is a great work on using MARL for in smart factories where a multitude of tasks such as scheduling, transportation, human-robot collaboration, maintenance and energy management were handled by MARL.From the investigation, it was found that MARL was a great candidate for solving multiple tasks efficiently.The imitation learning approach (Learning to imitate expert behaviour) [34] is another RL method that can be incorporated to solve challenging tasks.This way of learning with RL generally deals with the problem of sparse rewards which can be a huge problem in robotics where there are many instances of difficult executions which can only be solved through a sparse reward configuration.A combination of both imitation learning with MARL is another implementation technique that can used to deal with highly complex system.The research in [35] is a similar technique where an underwater vehicle is trained to avoid obstacles with multi agent generative adversarial imitation learning.It was found that through this approach the agents were able to generalise to complex task once trained with simple tasks.Similarly, it would certainly be interesting to test the same approach for manipulator based robots to work on multiple tasks in a cooperative manner such as building a car.Here, the expected behaviour of each agent can be imitated in the initial phase of the training and then use it to further match the initial behaviour for each agent.
Overall the future prospects and the possibilities of RL with robotics are numerous and further with the research on Artificial General Intelligence (AGI), a lot of researchers believe that RL can be a pathway in achieving AGI.This further broadens the scope of what RL could achieve in the sector of robotics.

Conclusion
In this article, the current advancements in the field of RL and robotics is highlighted.For both AGV's as well as redundant robots multiple experiments were analysed and the capability of RL in various control strategies were proven to be successful.From the analysis it was also noted that realistic simulations such as Nvidia Isaac Sim, further improved the efficiency and feasibility of working with RL and robotics.The section for the comparison of the conventional robotics approaches to that of RL based approaches also states the advantages of using RL in complex robotics tasks.The major takeaway from the comparison is that the adaptability to environment changes is easier for RL based approaches which helps in drastically reducing the development time.Towards the end, a discussion on the further possibilities with RL in robotics is emphasised.

Figure 2 :
Figure 2: Continuous Control with SAC for mobile robot navigation[17]