Hi everybody, um, I’m Emma Brunskill. Um, I’m an assistant professor in Computer Science and welcome to CS234, um, which is a reinforcement learning class, um, which was designed to be sort of an entry-level masters or PhD student in an introduction to reinforcement learning. So, what we’re gonna do today is I’m gonna start with just a really brief overview of what is reinforcement learning. Um, and then we’re gonna go through course logistics and when I go through course logistics, I’ll also pause and ask for any questions about logistics. Um, the website is now live and so that’s also the best source of information about the class. That and Piazza will be the best source of information. Um, uh, so I’ll stop there when we get to that part to ask if there’s anything that I don’t go over that you have questions about and if you have questions about the wait-list or any particular things relating to your own circumstance, feel free to come up to me at the end. Um, and then the third part of the class is gonna be where we start to get into the technical content that we’re thinking about, uh, an introduction to sequential decision making under uncertainty. Um, just so I have a sense before we get started, who here has taken a machine learning class? All right. Who here has taken AI? Okay. So, a little bit less but most people. All right. Great. So, probably everybody here has seen a little bit about reinforcement learning. Um, varies a little bit depending on where you’ve been at. We will be covering stuff starting from the beginning as if you don’t know any reinforcement learning, um, but then we’ll rapidly be getting to other content, um, that’s beyond anything that’s covered in at least other Stanford related classes. So, reinforcement learning is concerned with this really foundational issue of how can an intelligent agent learn to make a good sequence of decisions? Um, and that’s sort of a single sentence that summarizes what reinforcement learning is. Do we know what we’ll be covering during this class? But it actually encodes a lot of really important ideas. Um, so the first thing is that we’re really concerned now with sequences of decisions. So, in contrast to a lot of what is covered in, uh, machine learning, we’re gonna be thinking about agents, intelligent agents or an intelligent agent in general that might or might not be human or biological. Um, and how it can make not just one decision but a whole sequence of decisions. We’re gonna be concerned with goodness. In other words, we’re gonna be interested in- the, the second thing is how do we learn to make good decisions, um, and what we mean by good here is some notion of optimality. We have some utility measure over the decisions that are being made. Um, and the final critical aspect of reinforcement learning is the learning, but, um, that the agent doesn’t know in advance how its decisions are gonna affect the world or what decisions might necessarily be associated with good outcomes, and instead it has to acquire that information through experience. So, when we think about this. This is really something that we do all the time. We’ve done it since we were babies. We try to figure out, how do you, um, sort of achieve high reward in the world and there’s a lot of really exciting work that’s going on in neuroscience and psychology, um, that’s trying to think about this same fundamental issue from the perspective of human intelligent agents. And so I think that if we wanna be able to solve AI, um, or make significant progress, we have to be able to make significant progress in allowing us to create agents that do reinforcement learning. So, where does this come up? There’s this, um, nice example from Yael Niv who’s, uh, an amazing sort of psychologist and neuroscience researcher over at Princeton. Um, where she gives us an example of this sort of primitive creature, um, which evolves as following during its lifetime. So, when it’s a baby, it has a primitive brain and one eye and it swims around and it attaches to a rock. And then when it’s an adult, it digests its brain and it sits there. And so maybe this is some indication that the point of intelligence or the point of having a brain in at least in part is helping to guide decisions, and so that once all the decisions and the agent’s life has been completed maybe we no longer need a brain. So, I think this is, you know, this is one example of a biological creature but I think it’s a useful reminder to think about why would an agent need to be intelligent and is it somehow fundamentally related to the fact that it has to make decisions? Now of course, um, there’s been a sort of really a paradigm shift in reinforcement learning. Um, around 2015, um, in the Neurex Conference which is one of the main machine learning conferences, David Silver came and went to a workshop and presented these incredible results of using reinforcement learning to directly control Atari games. Now, these are important whether you like video games or not. Um, video games are a really interesting example of sort of complex tasks that take human players a while often to learn. We don’t know how to do them in advance. It takes us at least a little bit of experience. And what the really incredible thing about this example was, this is, uh, Breakout, is that the agent learns to play directly from pixel input. So, from the agent’s perspective, they’re just seeing sort of these colored pixels coming in and it’s having to learn what’s the right decisions to make in order to learn to play the game well and in fact even better than people. So, this was really incredible that this was possible. Um, when I first started doing reinforcement learning, a lot of the work was really focused on very artificial toy problems. Um, a lot of the foundations were there but these sort of larger scale applications we’re really lacking. And so I think in the last five years, we’ve seen really a huge improvement, um, in the types of techniques that are going on in reinforcement learning and in the scale of the problems that can be tackled. Now, it’s not just in video game, um, playing. Uh, it’s also in things like robotics, um, and particularly some of my colleagues up at University of, um, California Berkeley, um, uh, had been doing some really incredible work on robotics and using reinforcement learning in these types of scenarios, um, to try to have the agents do grasping, fold clothes, things like that. Now, those are some of examples if you guys have, um, looked at reinforcement learning before, are probably the ones you’ve heard about. You probably heard about things like video games or robotics. Um, but one of the things that I think is really exciting is that, uh, reinforcement learning is actually applicable to a huge number of domains, um, which is both an opportunity and a responsibility. So, in particular, um, I direct the AI for human Impact Lab here at Stanford and one of the things that we’re really interested in is how do we use artificial intelligence to help amplify human potential? So, one way you could imagine doing that is through something like educational games. Where the goal is to figure out, um, how to quickly and effectively teach people how to learn material such as fractions. Another really important application area is health care. Um, this is sort of a cutout, um, of looking at seizures that some work that’s been done by Joel Pineau up at McGill University and I think there’s also a lot of excitement right now thinking about how can we use AI in a particular reinforcement learning, um, to do things like to interact with things like electronic medical records systems and use them to inform patient treatment. There’s also a lot of recent excitement and thinking about how we can use reinforcement learning and lots of other applications kind of as an optimization technique for when it’s really hard to solve optimization problems. And so this is arising in things like natural language processing in vision and a number of other areas. So, I think if we have to think about what are the key aspects of reinforcement learning, they probably boil down to the following four, and these are things that are gonna distinguish it from other aspects of AI and machine learning. So, reinforcement learning from my sentence about that we’re learning to make good decisions under uncertainty, fundamentally involves optimization, delayed consequences, exploration and generalization. So, optimization naturally comes up because we’re interested in good decisions. There’s some notion of relative different types of decisions that we can make, um, and we want to be able to get decisions that are good. The second situation is delayed consequences. So, this is the challenge that the decisions that are made now. You might not realize whether or not they’re a good decision until much later. So, you eat the chocolate Sunday now and you don’t realize until an hour later that that was a bad idea to eat all two courts of ice cream or, um, you in the case of things like video games like Montezuma’s Revenge, you have to pick up a key and then much later you realize that’s helpful or you study really hard now and Friday night and then three weeks you do well on the midterm. So, one of the challenges to doing this is that because you don’t necessarily receive immediate outcome feedback, it can be hard to do what is known as the credit assignment problem which is how do you figure out the causal relationship between the decisions you made in the past and the outcomes in the future? And that’s a really different problem than we tend to see in most of machine learning. So, one of the things that comes up when we start to think about this is how do we do exploration? So, the agent is fundamentally trying to figure out how the world works through experience in much of reinforcement learning and so we think about the agent as really kinda being the scientist of trying things out in the world like having an agent that tries to ride a bicycle and then learning about how physics and riding a balanced bike works by falling. And one of the really big challenges here is that data is censored and what we mean by censoring in this case is that you only get to learn about what you try to do. So, all of you guys are here at Stanford clearly that was the optimal choice. Um, but you don’t actually get to figure out what it would have been like if you’d went to MIT it’s possible that would’ve been a good choice as well, but you can’t- you, can’t experience that because you only get to live one life and so you only get to see that particular choice you made at this particular time. So, one question you might wonder about is, um, you know, policy, what we’re gonna, we’re gonna talk a lot about policies. Policies, decision policies is gonna be some mapping from experiences to a decision. And you might answer why, we, this needs to be learned. So, if we think about something like Deep Mind, um, Atari playing game. What it was learning from here, is it was learning from pixels. So, it was essentially learning from the space of images what to do next. And if you wanted to write that down as a program, a series of if then statements, it would be absolutely enormous. This is not tractable. So, this is why we need some form of generalization and why it may be much better for us to learn from data directly, as well as to have some high level representation of the task. So, that even if we then run into a particular configuration of pixels we’ve never seen before, our agent can still know what to do. So, these are sort of the four things that really make up reinforcement learning, at least online reinforcement learning and why are they different than some other types of AI and machine learning. So, another thing that comes up a lot in artificial intelligence is planning. So, for example, the Go game, um, is, can be part of as a planning problem. So, what does planning involve? Involves optimization, often generalization and delayed consequences. You might take a move and go early and it might not be immediately obvious if that was a good move until many steps later but it doesn’t involve exploration. The idea and planning is that you’re given a model of how the world works. So, your given the rules of the game, for example, and you know what the reward is. Um, and the hard part is computing what you should do given the model of the world. So, it doesn’t require exploration. And supervised machine learning versus reinforcement learning. It often involves optimization and generalization but frequently it doesn’t invo-, involve either exploration or delayed consequences. So, it doesn’t tend to involve exploration because typically in supervised learning you’re given a data set. So, your agent isn’t collecting its experience or data about the world instead it’s given experience and that it has to use that to say in for whether an image is a face or not. Similarly, um, it’s typically making essentially one decision like whether this image is a face or not instead of having to think about making decisions now and then only learning whether or not those were the right decisions later. Unsupervised machine learning awful, also involves optimization and generalization but generally does not involve exploration or delayed consequences and typically you have no labels about the world. So, in supervised learning, you often get the exact label for the world like this image really is, has a, contains a face or not. Um, in unsupervised learning you normally get no labels about the world and an RL you typically get something kind of halfway in between those which you get a, a utility of the label you put. So, for example, you might decide that there’s a face in here and it might say, ”Okay, yeah, we’ll give you partial credit for that,” because maybe there’s something that looks sort of like a face. But you don’t get the true label of the world or maybe you decide to go to Stanford, um, and then you don’t know. And you’re like okay that was a really great experience but I don’t know if it was, ”the right experience.” Imitation learning which is something that we’ll probably touch on briefly in this class and is becoming very important, um, is similar, um, but a little bit different. So, in, uh, it involves optimization, generalization, and often delayed consequences but the idea is that we’re going to be learning from experience of others. So, instead of our intelligent agent getting to ex-, take experiences, um, from the world and make its own decisions, it might watch another intelligent agent which might be a person, make decisions, observe outcomes and then use that experience to figure out how it wants to act. There’ll be a lot of benefits to doing this but it’s a little bit different because it doesn’t have to directly think about the exploration problem. Imitation learning and I just want to spend a little bit more time on that one because it’s become increasingly important. So, to my knowledge, it was first really sort of popularized by Andrew Ng, um, who’s a former professor here, um, through some of his helicopter stuff where he was looking at expert flights together with Pieter Abbeel, whose a professor over at Berkeley, um, to see how you could imitate very quickly, um, experts flying toy helicopters. And that was one of sort of the first kind of major application successes of invitation learning. It can be very effective. There can be some challenges to it because essentially, if you get to observe one trajectory, let’s imagine it’s a circle of a helicopter flying and your agent learns something that isn’t exactly the same as what the expert was doing, that you can essentially start to go off that path and ven-, venture into territory where you really don’t know what the right thing is to do. So, there’s been a lot of extensive work on imitation learning that’s sort of combining between imitation learning and reinforcement learning that ends up being very promising. So, in terms of how we think about trying to do reinforcement learning, we can build on a lot of these different types of techniques. Um, and then also think about some of the challenges that are unique to reinforcement learning which involves all four of these challenges. And so these RL agents really need to explore the world and then use that exploration to guide their future decisions. So, we’ll talk more about this throughout the course. Um, a really important question that comes up is where do these rewards come from, where is this information that the agents are using to try to guide whether or not their decisions are good, um, and who is providing those and what happens if they’re wrong? And we’ll talk a lot more about that. Um, we won’t talk very much about multi agent reinforcement learning systems but that’s also a really important case, as well as thinking about game theoretic aspects, right. So, that’s just a really short overview about some of the aspects of reinforcement learning and why it’s different than some of the other classes that you might have taken. Um, and now we’re gonna go briefly through course logistics and then start sort of more of the content and I’ll pause after course logistics to answer any questions. In terms of prerequisites, um, we expect that everybody here has either taken an AI class or a machine-learning class either here at Stanford or the equivalent to another institution. And if you’re not sure whether or not you have the right background for the class, feel free to reach out to us on Piazza and we will respond. Um, if you’ve done extensive work in sort of related stuff, it will probably be sufficient. In general, we expect that you have basic Python proficiency, um, and that you’re familiar with probability, statistics, and multi-variable calculus. Um, things like gradient descent, loss derivatives, um, those should all be very familiar to you. Um, and I expect that most people have probably heard of MDPs, um, before, but it’s not totally critical. So, this is a long list [LAUGHTER] but I’ll go through it slowly because I think it’s pretty important. So, this is what are the goals for the class, what are the learning objectives? So, these are the things that we expect that you guys should be able to do by the time you finish this class and that it’s our role to help you be able to understand how to do these things. So, the first thing is that it’s important to be able to define the key features of reinforcement learning that distinguish it from other types of AI and machine learning, um, frames of problems. So, that’s what I was doing a little bit of so far in this class to figure out, how does this distinguish this. How does RL distinguish itself from other types of pro-, problems. So, related to that, um, for most of you, you’ll probably not end up being academics, um, and most of you will go into industry. And so, one of the big challenges when you do that is that when you’re faced with a particular problem from your boss or when you’re giving a problem to one of your, um, supervisees is for them to think about whether or not it should be framed as a reinforcement learning problem, um, and what things are applicable to it. So, I think it’s very important that by the end of this class, that you have a sense of if you’re given a real-world problem like web advertising or patient treatment or robotics problem, um, that you have a sense whether or not it is useful to formulate it as a reinforcement learning problem and how to write it down in that framework and what algorithms are relevant. Um, during the class, uh, we’ll also be introducing you to a number of reinforcement learning algorithms, um, and you will have the chance to implement those in code including deep reinforcement learning cla-, uh, problems. Another really important aspect is if you’re trying to decide what tools to use for a particular, say robotics problem or health care problem, um, is to understand which of the algorithms is likely to be beneficial one and why. And so, in addition to things like empirical performance, I think it’s really important to understand, generally, how do we evaluate algorithms. Um, and can we use things like theoretical tools like regret sample complexity, um, as well as things like computational complexity to decide which algorithms are suitable for particular tasks. And then the final thing is that one really important aspect of reinforcement learning is exploration versus exploitation. This issue that arises when the agents have to figure out what decisions they wanna make and what they’re gonna learn about the environment by making those decisions. And so, by the end of the class, you should also be able to compare different types of techniques for doing exploration versus exploitation and what are the strengths and limitations of these. Does anyone have any questions about what these learning objectives are. Okay. So, we’ll have three main assignments for the class, um, will also have a midterm. Um, we’ll have a quiz at the end of the class, um, as well as a final project. The quiz is a little bit unusual. Um, so, I just want to spend a little bit of time talking about it right now. The quiz is done on both individually and in groups. Um, the reason that we do this is because we want a low stakes way to sort of have people practice with the material that they learn in the second half of the course. Um, in a way that’s sort of fun engaging and really tries to get you to think about it and also learn from your peers. Um, and so, we did it last year and I think a number of people who are a little bit nervous about how it would go before and then ended up really enjoying it. So, the way that the quiz works is it’s a multiple choice quiz. At the beginning and everybody does it by themselves and then after everybody has submitted their answers, then we do it again in groups that are pre-assigned by us. And the goal is that you have to get everyone to decide on what the right answer is before you scratch off and see what the correct answer is. And then we grade it according to, um, whether you scratched off the right answer, correctly first or not. You can’t do worse than your individual grade. So, doing it in a group can only help you. Um, and for SCPD students, they don’t do it in groups. So, they just write down justifications for their answers. Again, um, it’s a pretty lightweight way to do assessment, um, the goal is that you sort of have to be able to articulate why you believe that answers are the way they are and discuss them in small groups and they use that informa-, um, use that to figure out what the correct answer is. Um, the final project is paired pretty similar to other projects that you guys have done in other classes. Um, it’s an open-ended project. It’s a chance to, uh, reason about, um, and, and think about reinforcement learning, uh, stuff in more depth. We will also be offering a default project that will be announced over the next couple of weeks before the first milestone is due. If you choose to do the default project, your breakdown, because you will not need to do a proposal or milestone, will be based on the project presentation in your assignment, uh, write up. Since we believe that, um, you guys are all of each other’s best resource, um, we use Piazza, um, that should be used for pretty much all class communication unless it’s something that’s sort of a private or sensitive manner in which case of course please feel free to reach out to the course staff directly, ah, and for things like lectures and homework and project questions pretty much all of that should go through Piazza. For late day policy, we have six late days, ah, for details you can see the webpage and for collaboration please see the webpage for some of the details about that. So before we go on to the next part, do I have any questions about logistics for the class? Okay, let’s get started. Um, so, we’re not going to do an introduction to sequential decision-making under uncertainty, a number of you guys who have seen some of this content before, um, we will be going into this in prime, more depth than you’ve seen for some of this stuff including some theory not theory today but in other lectures, and then we’ll also be moving on to content that will be new to all of you later in the class. So, sequential decision-making under uncertainty. Um, the fundamental that we- thing that we think about in these settings is sort of an interactive closed-loop process, where we have some agent, an intelligent agent hopefully that is taking actions that are affecting the state of the world and then it’s giving back an observation and a reward. The key goal is that the agent is trying to maximize the total expected future reward. Now, this expected aspect, um, is going to be important because sometimes the world itself will be stochastic and so the agent is going to be maximizing things in expectation, this may not always be the right criteria, um, this has been what has been focused on for the majority of reinforcement learning but there’s now some interest in thinking about distribution honorable, RL and some other aspects. One of the key challenges here is that it can require balancing between immediate and long-term rewards and that it might require strategic behavior in order to achieve those high rewards, indicating that you might have to sacrifice initial higher rewards in order to achieve a better awards over the long-term. So as an example, something like web advertising might be that you have an agent that is running the website and it has to choose which web ad to give to a customer, the customer gives you back an observation such as how long they spent on the web page, and also you get some information about whether or not they click on an ad, and the goal is to say how people click on ads the most. So you have to pick which ad to show people so that they’re going to click on ads. Another example is a robot that’s unloading a dishwasher, so in this case the action space of the agent might be joint movements. The information that agent might get backwards are camera image of the kitchen and it might get a plus one reward if there are no dishes on the counter. So in this case it would generally be a delayed reward, for a long time there’re going to be dishes on the counter, er, unless it can just sweep all of them off and have them crash onto the floor, which may or may not be the intended goal of the person who’s writing the system. Um, and so, it may have to make a sequence of decisions where it can’t get any reward for a long time. Another example is something like blood pressure control, um, where the actions might be things like prescribed exercise or prescribed medication and we get an observation back of what is the blood pressure of the individual. Um, then the reward might be plus one if it’s in the- if the blood pressures in a healthy range maybe a small negative reward if medication is prescribed due to side effects and maybe zero reward otherwise. [NOISE] So, let’s think about another case, like some of the cases that I think about in my lab like having an artificial tutor. So now what you could have is you could have a teaching agent, and what it gets to do is pick an activity, so pick a teaching activity. Let’s say it only has two different types of teaching activities to give, um, it’s going to either give an addition activity or a subtraction activity and it gives this to a student. Then the student either gets the problem right, right or wrong. And let’s say the student initially does not no addition or subtraction. So, it’s a kindergartner that student doesn’t know anything about math and we’re trying to figure out how to teach the student math, and that the reward structure for the teaching agent is they get a plus one every time a student gets something right and they get a minus one if the student gets it wrong. So, I’d like you to just take a minute turn to somebody nearby and describe what you think an agent that’s trying to learn, to maximize its expected rewards would do in this type of case, what type of problems it would give to the student and whether or not that is doing the right thing. [NOISE]. Let me just- let me just clarify here, and let me just clarify here [NOISE]. Let me just clarify here is that let’s assume that for most students addition is easier than subtraction, so that, like what it says here that the problem even though the student doesn’t know either of these things that the skill of learning addition is simpler for a new student to learn than subtraction. So what would, what might happen under those cases? Is there maybe we want to, raise their hand and tell me what they and somebody nearby them was thinking might happen for an agent in this scenario? [NOISE]. The agent would give them really easy addition problems, that’s correct. That’s exactly actually what happened. There’s a nice paper from approximately 2,000 with Bev Wolf, which is one of the earliest ones but I know where they’re using reinforcement learning to create an intelligent tutoring system and the reward was for the agent to, to give problems to the student in order to get them correct. Because, you know, if the students getting things correct them they’ve learned them. But the problem here is with that reward specification what the agent learns to do is to give really easy problems, and then maybe the student doesn’t know how to do those initially but then they quickly learn how and then there’s no incentive to give hard problems. So this is just sort of a small example of what is known as reward hacking, [LAUGHTER] which is that your agent is gonna learn to do exactly what it is that you tell him to do in terms of the rewards function that you specify and yet in reinforcement learning, often we spend very little of our time thinking very carefully about what that reward function is. So, whenever you get out and test for the real world this is the really really critical part. But normally, it is the designer that gets to pick what the reward function is, the agent is not having intrinsic internal reward and so depending on how you specify it, the agent will learn to do different things. Yeah, was there question in the back? In this case, it seems like the student will also be RL agent and that like in real life the student, so what we asked for her questions so techniques to approach or is it okay that we ignore that part? So, the question was to say well, you know, we also think that people are probably reinforcement learning agents as well and that’s exactly correct, and maybe they would start to say, “Hey, I need to get harder questions, or be interactive in this process.” For most of this class we’re going to ignore the fact that the world that we interact with itself might also be an RL agent, in reality it’s really critical, um, sometimes this is often considered in an adversarial way like for game theory, I think one of the most exciting things to me is when we think about it in a cooperative way? Um, so, who here has heard about the sub-discipline of machine teaching? Nobody yet, so, er, it’s a really interesting new area that’s been around for maybe 5-10 years, some a little bit beyond that. One of the ideas there is, what happens if you have two intelligent agents that are interacting with each other where they know that each other’s trying to help them? Er, so there’s a really nice classic example from sorry for those of you that aren’t so familiar with machine learning but, imagine that you’re trying to learn a classifier to decide where along this line things are either positive or negative. So in general you’re going to need some amount of samples, samples if you, uh, wear that sort of the number of points on the line where you have to get positive or negative labels. Um, if you’re in an active learning setting, generally I think you can reduce that to roughly log n by being strategic about asking people to label particularly points in a line, one of the really cool things for machine teaching is that, if I know you are trying to teach me where to divide this line, you’ll only need one point or at most two points essentially constant, right? Because, if I’m trying to teach you, there’s no way I’m just going to randomly label things. I’m just gonna label you a single plus and a minus and that’s gonna tell you exactly where the line goes. So that’s one of the reasons why if the agent knows that the other agent is trying to teach them something, it can actually be enormously more efficient than what we normally think of for learning. And so, I think there’s a lot of potential for machine teaching to be really effective. But all that said, we’re going to ignore most of that for the course, if it’s something you want to explore in your project, you’re very welcome to. There’s a lot of connections with reinforcement learning. Okay. So, if we think about this process in general um, if we think of sort of a sequential decision making process, we have this agent. We’re going to think about almost always about there being discreet timer. So, agent is gonna make a decision, it’s gonna affect the world in some way, it’s gonna see the world, it’s gonna give some new observation and a reward. The agent receives those and uses it to make another decision. So, in this case when we think about a history, what we mean by history is simply the sequence of previous actions that the agent took, and the observations and rewards it received. Then the second thing that’s really important is to define a state-space. Again, often when this was first discussed, this is sort of thought about is some immutable thing. But whenever you’re in a real application, this is exactly what you have to define, is how to write down the representation of the world. Um, what we’re going to assume in this class is that the state is a function of the history. So, there might be other aspects of- there might be other sensory information that the agent would like to have access to in order to make its decision. But it’s going to be constrained to the observations is received so far, the actions is taken, and the rewards is observed. Now, there’s also gonna be some real-world state. So, that’s the real world. The agent doesn’t necessarily have access to the real world. They may have access only to a small subset of the real world. So, for example as a human, right now, I have eyes that allow me to look forward. You know, roughly 180 degrees. Um, but I can’t see behind my head. But behind my head is still part of the world state. So, the world state is the real world, and then the agent has its own state space it uses to try to make decisions. So, in general, we’re gonna assume that it has some function of the history. Now, one assumption that we’re gonna use a lot in this class which you guys have probably seen before is the Markov assumption. The Markov assumption simply says that we’re going to assume that the state used by the agent uh, is a sufficient statistic of the history, and that in order to predict the future, you only need to know the current state of the environment. So, it’s simply basically indicates that the future is independent of the past given the present, if in the present, you have the right aggregate statistic. [NOISE] So, as a couple of examples of this, yeah? Question name and-. Would you just explain maybe with an example the difference again between the state and the history? Like I’m having trouble to differentiate. Yeah. So, the state, um, uh, if we think about something like uh, um, a robot. Um, so let’s say you have a robot that is walking down a long corridor. Okay. Let’s say there’s two long corridors. Okay. So, your robot starts here. This is where your robot starts, and it tries to go right, right, and then it goes down, down, down. Okay. Let’s say its sensors are just that it can observe whether in front of it um, uh, um, whether there is a wall on any of its sides. So, it can- the observation space of the robot is simply is there a wall on any side-on each of its four sides? I’m sorry, it’s probably a little bit small on the back. But the agent basically has, you know, some sort of local amount via laser range finder or something like that. So, it knows whether or not there’s a wall immediately around it, that has been immediately around it square, and nothing else. So, in this case, what the agent would see is that initially the wall looks like this, and then like this, and then like this, and then like this. The history would include all of this. But it’s local state is just this. So, local state could just be the current observation. That starts to be important when you’re going down here because there are many places that looked like that. So, if you keep track of the whole history, the agent can figure out where it is. But if it only keeps track of where it is locally, then a lot of partial aliasing can occur. So, I put up a couple of examples here. So, in something like hypertension control, you can imagine the state is just the current blood pressure, um and your action is whether to take medication or not. So, current blood pressure meaning like you know, every second for example what is your blood pressure? So, do you think this sort of system is Markov? I see some people shaking their heads. Almost definitely not. Almost definitely there are other features that have to do with, you know, maybe whether or not you’re exercising, whether or not you just ate a meal, whether it’s hot outside. What the- if you just got an the airplane. All these other features probably affect whether or not your next blood pressure is going to be high or low and particularly in response to some medication. Um, similarly in something like website shopping, um, you can imagine the state is just sort of what is the product you’re looking at right now? So, like I open up A- Amazon, I’m looking at some um, you know, computer, and um that’s up on my webpage right now, and the action is what other products to recommend. Do you think that system is Markov? Systems is not Markov? Do you mean the system generally? But if the assumption is Markov and if it doesn’t fit? Question is whether or not the system generally is Markov and the assumption just doesn’t fit or make- just some more details. I’ll think about this. What I mean here is that this particular choice re-representing the system is that Markov. Um, and so, there’s the real-world going on, and then there’s sort of the model of the world that the agent can use. What I’m arguing here is that these particular models of the world are not Markov. There might be other models of the world that are. Um, but if we choose this particular observation say just the current blood pressure as our state, that is probably not really a Markov state. Now it doesn’t mean that we can’t use algorithms that treat it as if it is. It is just that we should be aware that we might be violating some of those assumptions. Yeah? Um, I’m wondering so if you include um, enough history into a state, can you make them part of the Markov? Okay. It’s a great question. So, why is it so popular? Can you know-can you always make something Markov? Um, generally yes. If you include all the- the history, then you can always make the system Markov. Um, in practice often you can get away with just using the most recent observation or maybe the last four observations as a reasonably sufficient statistic. It depends a lot on the domain. There’s certainly domain, maybe like the navigation world I put up there where it’s really important to model. Either use the whole history as the- as the state um, or think about the partial observability um, and other cases where you know, maybe the current- the most recent observation is completely sufficient. Now, one of the challenges here is you might not want to use the whole history because that’s a lot of information. [LAUGHTER] and you have to keep track of it over time. And so, it’s much nicer to have sort of a sufficient statistic. Um, of course, some of these things are changing a little bit with LSTMs and other things like that. So, um, some of our prior assumptions about how things scale with the size of the state-space are changing a little bit right now with deep learning. Um, but historically certainly, there’s been advantages to having a- a smaller state-space. And um, again historically, there’s been a lot of implications for things like computational complexity, the data required, and the resulting performance depending on the size of the state space. So, just to give some intuition for why that might be, um, if you made your state everything that’s ever happened to you in your life, um, that would give you a really, really rich representation. You’d only have one data point for every state. There would be no repeating. So, it’s really hard to learn because um, they’re- all states are different. Um, and in general if we wanna learn how to do something, we’re gonna either need some form a generalization or some form of clustering or aggregation so that we can compare experiences, so that we can learn from prior similar experience in order to what to do. So, if we think about assuming that your observation is your state, so the most recent observations that the agent gets, we’re gonna treat that as the state. Then we- the agent is modelling the world is that Markov decision process. So, it is thinking of taking an action, getting observation and reward, and it’s setting the state, the world state- that the environment state it’s using to be the observation. If the world- if it is treating the world as partially observable um, then it says the agent state is not the same, um, and it sort of uses things like the history or beliefs about the world state to aggregate the sequence of previous actions taken and observations received, um, and uses that to make its decisions. For example, in something like poker, um, you get to see your own cards. Other people have cards that are clearly affecting the course of the game. Um, but you don’t actually know what those are. You can see which cards are are discarded And so that’s somewhere where it’s naturally partially observable. And so you can maintain a belief state over what the other cards or at the other players. And you can use that information or to make your decisions. And similarly often in health care there’s a whole bunch of really complicated physiological processes that are going on but you could monitor parts of them for things like you know blood pressure or temperature et cetera. Uh, and then use that in order to make decisions. So, in terms of types of sequential decision making processes, um, one of them is Bandits. We’ll talk more about this later the term. Um, Bandits is sort of a really simple version of a markup decision process in the sense that the ideas that the actions that are taken have no influence over the next observation. So, when might this be reasonable? So, let’s imagine that you have a series of customers coming to your website and you show each of them an ad. So, and then they either click on it or not and then you get another customer login into your website. So, in this case the ad that you show to customer one, generally doesn’t affect who cuts- which customer two comes along. Now it could maybe in really complicated ways maybe customer one goes to Facebook and says I really really loved this ad, you should go watch it. Um, but most of the time whatever ad you showed a customer one does not at all affect who next logs into your website. And so the decisions you make only affect the local, um, the first customer and then the customer two is totally independent. Bandits have been really really important, um, for at least 50 years. Um, people thought about them for things like clinical trials, how to allocate people to clinical trials. You will think of them for websites and a whole bunch of other applications. MDPs and POMDPs say no wait the actions that you take can affect the state of the world, they affect often the next observation you get, um, as well as the reward. And you have to think about this closed loop system of the actions that you’re taking changing the state of the world. So, the product that I recommend to my customer might affect what the customer’s opinion is on the next time-step. In fact, you hope it will. Um, and so in these cases we think about, um, the actions actually affecting the state of the world. So, another important question is how the world changes? Um, one idea is that it changes deterministically. So, when you take an action in a particular state, you go to a different state but the state you go to it’s deterministic. There’s only one. And this is often a pretty common assumption in a lot of robotics and controls. All right. Remember, um, Tomás Lozano-Pérez who’s a professor over at MIT ones suggesting to me that if you flip a coin, it’s actually deterministic process. We’re just modeling it as stochastic. We don’t have good enough models. Um, so, there are many processes that if you could sort of write down, um, a sufficient perfect model of the world it would actually look deterministic. Um, but in many cases even though it maybe hard to write down those models. And so we’re going to approximate them as stochastic. And the idea is that then when we take an action there are many possible outcomes. So, you couldn’t show an ad to someone and they may or may not click on it. And we may just want to represent that with a stochastic, stochastic model. So, think about a particular example. So, if we think about something like Mars Rover, um, ah when we deploy rovers or robots on really far-off, um, planets, it’s hard to do communication back and forth. So, be nice to be able to make these sort of robots more autonomous. Let’s imagine that we have a very simple Mars rover that’s, um, thinking about a seven state system. So, it’s just landed. Um, it’s got a particular location and it can either try to go left or try to go the right. I write down try left or try right meaning that that’s what it’s going to try to do but maybe you’ll succeed or fail. Let’s imagine that there’s different sorts of scientific information to be discovered, and so over in S1 there’s a little bit of useful scientific information but actually over at S7 there’s an incredibly rich place where there might be water. And then there’s zero in all other states. So, we’ll go through that is a little bit of an example. As I start to talk about different common components of an oral agent. So, one often common component is a model. So, a model is simply going to be a representation of the agent has for what happens in the world as it takes its actions and what rewards it might get. So, in the case of the markup decision process it’s simply a model that says if I start in this state and I take this action A, what is the distribution over next states I might reach and it also is going to have a reward model that predicts the expected reward of taking, um, an action in a certain state. So, in this case, ah, let’s imagine that the reward of the agent is that it thinks that there’s zero reward everywhere. Um, and let’s imagine that it thinks its motor control is very bad. And so it estimates that whenever it tries to move with 50% probability it stays in the same place and 50% probability it actually moves. Now the model can be wrong. So, if you remember what I put up here the actual reward is that in state S1 you get plus one and in state S7 you get S you get 10 and everything else you get zero. And the reward I just wrote down here is that it’s zero everywhere. So, this is a totally reasonable reward model the agent could have. It just happens to be wrong. And in many cases the model will be wrong, um, but often can still be used by the agent in useful ways. So, the next important component that is always needed by an oral agent is a policy. Um, and a policy or decision policy is simply how we make decisions. Now, because we’re thinking about Markov decision processes here, we’re going to think about them as being mappings from states to actions. And a deterministic policy simply means there’s one action prostate. And the stochastic means you can have a distribution over actions you might take. So, maybe every time you drive to the airport, you flip a coin to decide whether you’re going to take the back roads or whether you’re going to take the highway. So, as a quick check imagine that in every single state we do the action try right is this the deterministic policy or stochastic policy? Deterministic great. We’ll talk more about why deterministic policies are useful and when stochastic policies are useful shortly. Now, the value function, um, is the expected discounted sum of future rewards under a particular policy. So, it’s a waiting. It’s saying how much reward do I think I’m going to get both now and in the future weighted by how much I care about immediate versus long-term rewards. The discount factor gamma is going to be between zero and one. And so the value function that allows us to say sort of how good or bad different states are. So, again in the case of the Mars rovers let’s imagine that our discount factor is zero. Our policy is to try to go right. And in this case say this is our value function. It says that the value of being in state one is plus one everything else is zero and the value of being in S7 is 10. Again, this might or might not be the correct value function. Depends also on the true dynamics model, but this is a value function that the agent could have for this policy. Simply tells us what is the expected discounted sum of rewards you’d get if you follow this policy starting in this state where you weigh each reward by gamma to the number of time steps at which you reach it. So, when we think about, yeah. So, if we wanted to extend the discount fac- factor to this example, um, would there be, like, ah, an increasing value or decreasing value to reward depending on how far it went? Yes. Question was if, if the Gamma was not 0 here. Um, so gamma is being 0 here indicates that essentially we just care about immediate rewards, whether or not we’d start to, sort of, if I understood correctly, you start to see like rewards slew into other states, and the answer is yes. So, we’ll see more of that next time, but if the discount factor is non-zero, then it basically says you care about not just the immediate reward you get, [NOISE] you’re not just myopic, you care about their reward you’re gonna get in the future. So, in terms of common types of reinforcement learning agents, um, some of them are model-based, which means they maintain in their representation a direct model of how the world works, like a transition model and a reward model. Um, and they may or may not have a policy or a value function. They always have to compute a policy. They have to figure out what to do. But they may or may not have an explicit representation for what they would do in any state. Um, model free approaches have an explicit value function and a policy function and no model. Yeah. Going back with [NOISE] the- the earlier slide, I’m confusing when the value function is evaluated ice with the- with well the setting yes. So, why is it not [NOISE] S_6 that has value of 10 because if you try right at S_6 you get to S_7. You were saying well how do I- when do we think of the rewards happening. Um, we’ll talk more about that next time. When really, uh, there’s many different ways people think of where the rewards happening. Some people think of it as the reward happening for the current state you’re in. Some people think of it as it’s the reward you’re in [NOISE] and the action you take. And some people- some- another common definition is r- SAS prime, meaning that you don’t see what reward you get until you transition. And this particular definition that I’m using here we’re assuming that rewards happened in one year in that state. All of them are, um, basically isomorphic, um, but we’ll try to be careful about which one we’re using [NOISE]. The most common one we’ll use in the class is s,a which says that when you’re in a state, and you choose a particular action, then you get a reward, and then you transition to your next state. Great question. Okay. So, when we think about reinforcement learning agents, and whether or not they’re maintaining these models and these values and these policies, um, we get a lot of intersection. So, I really like this figure from David Silver, um, I- where he thinks about, sort of, RL algorithms or agents mostly falling into these three different classes. They even have a model or explicit policy or explicit value function. And then there’s a whole bunch of algorithms that are, sort of, in the intersection of these. So, things like actor critic often have an explicit. And what do I mean by explicit? I mean like often they have a way so that if you give it a state you could tell- I could tell you what the value is, if I give you a state you could tell me immediately what the policy is, without additional computation. So, actor-critic combines value functions and policies. Um, there’s a lot of algorithms that are also in the intersection of all of these different ones. And often in practice it’s just very hopeful to maintain. Many of these and they have different strengths and weaknesses. For those of you that are interested in the theoretical aspects of learning theory, there’s some really cool recent work, um, that explicitly looks at what is the formal foundational differences between model-based and model-free RL that just came out of MSR, Microsoft Research [NOISE] , um, in New York, which indicates that there may be a fundamental gap between model-based and model-free methods, um, which on the deep learning side has been very unclear. So, feel free to come ask me about that. So, what are the challenges in learning to make good decisions, um, in this, sort of, framework? Um, one, is this issue of planning that we talked about a little bit before, which is even once I’ve got a model of how the world works, I have to use it to figure out what decisions I should make, in a way that I think it’s going to allow me to achieve high reward. Um, [NOISE] and in this case if you’re given a model you couldn’t do this planning without any interaction in the real world. So, if someone says, here’s your transition model, and here’s your reward model, you can go off and do a bunch of computations, on your computer or by paper, and decide what the optimal action is to do, and then go back to the real world and take that action. It doesn’t require any additional experience to compute that. But in reinforcement learning, we have this at other additional issue that we might want to think about not just what I think is the best thing for me to do given the information I have so far, but what is the way I should act so that I can get the information I need in order to make good decisions in the future. So, [NOISE] it’s, like, you know, you go to a brand new restaurant, and, ah, let’s say- let’s say you move to a new town, you go to- there’s only one restaurant, you go there the first day, and they have five different dishes. You’re gonna be there for a long time, and you wanna optimizing at the best dish. And so maybe the first day you try dish one, and the second day you tr- try dish two, and then the third day three, and then et cetera so that you can try everything, and then use that to figure out which one is best so that over the long term you pick something that is really delicious. So, in this case the agent has to think explicitly about what decision it should take so it can get the information it needs so that in the future it can make good decisions. So, in the case of planning, and the fact that this is already a hard problem, um, you think about something like solitaire, um, you could already know the rules of the game, this is also true for things like go or chess or many other scenarios. Um, and you could know if you take an action what would be the probability distribution of the next [NOISE] state, and you can use this to compute a potential score. And so using things like tree search or dynamic programming, and we’ll talk a lot more [NOISE] about these, um, ah, particularly the dynamic programming aspect you can use that to decide given a model of the world what is the right decision to make. But sol- the reinforcement learning itself is a little bit more like solitary without a rule book. We’re here just playing things and you’re observing what is happening, and you’re trying to get larger reward. And you might use your experience to explicitly compute a model and then plan in that model, or you might not and you might directly compute a policy or a value function. Now, I just wanna reemphasize here this issue of exploration and exploitation. So, in the case of the Mars rover it’s only going to learn about how the world works for the actions it tries. So, in state S2 if it tries to go left it can see what happens there. And then from there it can decide the right next action. Now, this is obvious but it can lead to a dilemma because it has to be able to balance between things that seem like they might be good, based on your prior experience, and things that might be good in the future, um, that perhaps you’ve got unlucky before. So, in exploration we’re often interested in trying things that we’ve never tried before, or trying things that so far might have looked bad, but we think in the future might be good. But an exploitation we’re trying things that are expected to be good given the past experience. So, here’s three examples of this. In the case of movies, um, exploitation is like watching your favorite movie, and exploration is watching a new movie, that might be good or it might be awful. Advertising is showing the ad that sealed the most highest click-through rate so far. Um, exploration is showing a different ad. And driving exploitation is trying the fastest route given your prior experience and exploration is driving a different route. [inaudible]. Great question, which is, what’s the imagine for that example that I gave? I am that you’re only going to be in town for five days. Um, and with the policy that you would compute in that case if you’re in a finite horizon setting, be the same or different as one where you know you’re going to live in this for all of infinite time. Um, we’ll talk a little bit more about this next, ah, next time but very different. Um, and in particular, um, the normally the policy if you only have a finite horizon is non-stationary which means that, um, the decision you will make depends on the time step as well as the state. In the infinite horizon case the assumption is that, um, the optimal policy and the mark off setting is stationary, which means that if you’re in the same state whether you’re there on time step three or time step 3,000 you will always do the same thing. Um, but in the finite horizon case that’s not true. And as a critical example of that. So, why do we explore? We explore in order to learn information that we can use in the future. So, if you’re in a finite horizon setting and it’s the last day is your last day in Hollywood and you know you’re trying to decide what to do, um, you’re not going to explore because there’s no benefit from exploration for future because you’re not making any more decisions, so in that case you will always exploit, its always optimal to exploit. So, in the finite horizon case, um, the decisions you make have to depend on the value of the information you gain to change your decisions and the remaining horizon. And there’s this often comes up in real cases. Yeah. How much, um, how much more complicated is if there’s a finite horizon but you don’t know where is this? Uh, is just something I remember from game theory this tends to be very complicated. How this [inaudible]? Question is what about what I would call it indefinite horizon problems where there is a finite horizon but you don’t know what it is that can get very tricky. One way to model it is as an infinite horizon problem with termination states. So, there are some states which are essentially stink states once you get there the process ends. There’s often happens in games, um, you don’t know when the game will end but it’s going to be finite. Um, and that answer that’s one way to put it into the formalism, um, but it is tricky. In those cases we tend to model it has infinite horizon and look at the probability of reaching different termination states. [inaudible] you miss exploitation, exploration essentially subproblems, I guess particulary for driving. It seems like it would be better to kind of exploit has you know are really good and maybe explore on some [inaudible] don’t know her as good rather than trying like completely brand new route. In about how this mix happens of exploration, exploitation and maybe in the cases of cars, maybe you would, um, sort of, er, not try things totally randomly. You might need some evidence that they might be good, um, it’s a great question, um, there’s generally it is better to intermix exploration exploitation. In some cases it is optimal to do all your exploration early or at least equivalent, um, and then it came from all of that information for later, but it depends on the decision process. Um, and we’ll spend a significant chunk of the course after the midterm thinking about exploration, exploitation, it’s definitely a really critical part of reinforcement learning, um, particularly in high stakes domains. What do I mean by high-stakes domains? I mean domains that affect people. So, whether it’s customers or patients or students, um, that’s where the decisions we make actually affect real people and so we want to try to learn as quickly as possible and make good decisions as quick as we can. Any other questions about this? If you’re in- you’re in sort of state that you haven’t seen before, do you have any other better option and just take a random action to get out of there? Or you can use your previous experience even though you’re not never been there before? The question is great. It’s the same if you’re in a new state you’ve never been in before, what do you do? Can you do anything better than random? Or can you somehow use your prior experience? Um, one of the really great things about doing generalization means that we’re going to use state features either learned by deep learning or some other representation to try to share information. So, that even though [NOISE] the state might not be one you’ve ever exactly visited before you can share prior information to try to inform what might be a good action to do. Of course if you share in the wrong direction, um, you can make the wrong decision. So, if you overshoot-overgeneralize you could overfit your prior experience and in fact that there’s a better action to do in the new scenario. Any questions for this? Okay. So, one of the things we’re going to be talking about over the next few lectures is this trend two really fundamental problems which is evaluation and control. So, evaluation is the problem as saying if someone gives you a policy, if they’re like hey this is what you should do or this is what your agent should do, this is how your robot should act in the world to evaluate how good it is. So, we want to be able to figure out you know your manager says Oh I think this is the right way we should show ads to customers, um, can you tell me how good it is? What’s the quick [inaudible]? Um, so one really important question is evaluation, um, and you know you might not have a model of the world. So, you might have to go out and gather data to try to evaluate this policy be useful to know how good it is, you’re not trying to make a new policy with not yet you’re just trying to see how good this current one is. And then the control problem is optimization. It’s saying let’s try to find a really good policy. This typically involves as a sub-component evaluation because often we’re going to need to know what does best mean? Best means a really good policy. How do we know how good the policies? We need to do evaluation. Now one of the really cool aspects of reinforcement learning, um, is that often we can do this evaluation off policy. Which means we can use data gathered from other policies to evaluate the counterfactual of what different policies might do. This is really helpful with because it means we don’t have to try out all policies exhaustively. So, um, in terms of what these questions look like, if we go back to our Mars Rover example for policy evaluation it would be if someone says your policy is this, in all of your states the action you should take as try right. This is the discount factor I care about, um, please compute for me or evaluate for me what is the value of this policy? In the control case, they would say I don’t know what the policy should be. I just want you to give me whatever ever policy has the highest expected discounted sum of rewards, um, and there’s actually sort of a key question here is. Okay. Expected discounted sum of rewards from what? So, they might care about a particular starting state, they might say I want you to figure out the best policy assuming I’m starting from S4. They might say I want you to compute the best policy from all starting states, um, or sort of some average. So, in terms of the rest of the course where we get- yeah. I was just wondering if it’s possible to learned the optimal policy and the reward function simultaneously? Through example if I has some belief of what the reward review that included or for some sort of action there will be a state and that turned out to be wrong, ah, we have to start over and trained to find optimal policy or could I use what I’ve learned so far. In addition assumption organization of data with a belief of the rewards that [inaudible]? Fake question, which is. Okay. Let’s say I have a policy to start with I’m evaluating it, um, and I don’t know what the reward function is and I don’t know what the optimal policy is and it turns out this [inaudible] isn’t very good. Do I need to sort of restart or can I use that prior experience to sort of inform what’s the next policy I try? Ah, perhaps a whole suite of different policies? In general you can use the prior experience in order to inform what the next policy is that you try our next suite of policies. Um, there’s a little bit of a caveat there which is, uh, you need to have some stochasticity in the actions you take. So, if you only take the same you know one action in a state, you can’t really learn about any other, um, actions you would take. So, you need to assume some sort of generalization or some sort of stochasticity in your policy in order for that information to be useful to try to evaluate other policies. This is a really important issue. This is the issue of sort of counterfactual reasoning and how do we use our old data to figure out how we should act in the future, um, if the old policies may not be the optimal ones. So, in general we can, um, and we’ll talk a lot about that it’s a really important issue. So, we’re first going to start off talking about sort of Markov decision processes and planning and talking about how do we sort of do this evaluation both whom we know how the world works, meaning that we are given a transition model and reward model and when we’re not, then we’re also going to talk about model-free policy evaluation and then model-free control. We’re going to then spend some time on deep-deep reinforcement learning and reinforcement learning in general with function approximation, which is a hugely growing area right now. Um, I thought about making a plot of how many papers are going on in this area right now it’s pretty incredible. Um, and then we’re going to talk a lot about policy search which I think in practice particularly in robotics is one of the most influential methods right now and we’re going to spend quite a lot of time on exploration as well as have, um, a few advanced topics. So, just to summarize what we’ve done today is talk a little bit about reinforcement learning, how it differs compared to other aspects of AI machine learning. We went through course logistics started to talk about sequential decision making under uncertainty. Just as a quick note for next time, um, we will try to post the lecture slides, um, two days in advance or by the end of you know the evening of two days in advance, so that you can print them out if you want to, um, in class. And I’ll see you guys on Wednesday.