0:00:09.469,0:00:11.309 Hello my name is Marshall Kirk McKusick 0:00:11.309,0:00:15.389 and I've been around as long as dinosaurs and mainframes have ruled the world 0:00:15.389,0:00:18.429 which is to say the sixties and seventies 0:00:18.429,0:00:22.460 however by 1970s a new breed of mammals had begun to show up on the scene 0:00:22.460,0:00:24.240 known as mini computers 0:00:24.240,0:00:28.230 although they were just toys in the 1970s they would soon grow 0:00:28.230,0:00:31.689 and take over most of the computing market 0:00:31.689,0:00:33.150 In 1970 0:00:33.150,0:00:37.910 at AT&T Bell laboratories two researchers Ken Thompson and Dennis Ritchie began developing the 0:00:37.910,0:00:39.900 UNIX operating system 0:00:39.900,0:00:42.040 Ken Thompson who had been an alumnus at Berkeley 0:00:42.040,0:00:46.100 came back on a sabbatical in 1975 bringing UNIX with him 0:00:46.100,0:00:47.539 In the year that he was there 0:00:47.539,0:00:51.330 he managed to get a number of graduate students interested in UNIX 0:00:51.330,0:00:53.940 and by the time he left in 1976 0:00:53.940,0:00:56.829 Bill Joy has taken over in running the UNIX system 0:00:56.829,0:01:00.470 and in fact continuing to develop software for it. 0:01:00.470,0:01:04.339 Bill began packaging up the software that had been developed under Berkeley UNIX and 0:01:04.339,0:01:05.779 and distributing it 0:01:05.779,0:01:08.040 as the Berkeley Software Distributions 0:01:08.040,0:01:12.310 whose name was quickly shortened to simply BSD 0:01:12.310,0:01:16.330 BSD continued to be distributed with yearly distributions for almost fifteen 0:01:16.330,0:01:17.490 years 0:01:17.490,0:01:21.920 initially under Bill Joy and later under others including yours truly. 0:01:21.920,0:01:24.860 By the late 1980s interest had began to grow 0:01:24.860,0:01:27.400 in freely redistributable software 0:01:27.400,0:01:30.170 so a number of us at Berkeley began separating out 0:01:30.170,0:01:32.649 the AT&T proprietary bits of BSD 0:01:32.649,0:01:35.710 from those parts that were freely redistributable. 0:01:35.710,0:01:40.590 By the time of the final distribution at BSD in 1992 0:01:40.590,0:01:43.620 the entire distribution was freely redistributable. 0:01:43.620,0:01:45.909 I live in a capsule history here 0:01:45.909,0:01:48.009 but if you're interested in the entire story 0:01:48.009,0:01:50.789 I have this three-and-an-half hour epic 0:01:50.789,0:01:54.590 which is available from my website www.mckusick.com 0:01:54.590,0:01:58.200 that gives the entire history of Berkeley. 0:01:58.200,0:02:00.239 Following the final distribution from Berkeley 0:02:00.239,0:02:01.450 two groups sprung up 0:02:01.450,0:02:03.600 to continue supporting BSD 0:02:03.600,0:02:08.080 the first of this was the NetBSD whose primary goal was to support 0:02:08.080,0:02:10.459 as many different architectures as possible 0:02:10.459,0:02:14.769 everything from your microwave oven all way upto your cray XMP 0:02:14.769,0:02:19.409 In fact today NetBSD supports nearly sixty architectures. 0:02:19.409,0:02:22.419 The other group that sprang up was FreeBSD. 0:02:22.419,0:02:28.239 Their goal was to bring up BSD and support as wide a set of devices as possible on the 0:02:28.239,0:02:29.719 PC architecture. 0:02:29.719,0:02:36.549 They also had a goal of trying to make the system as easy to install as possible to 0:02:36.549,0:02:39.309 attract by a wide group of developers 0:02:39.309,0:02:42.319 I chose to work primarily with the FreeBSD group 0:02:42.319,0:02:43.740 both doing software 0:02:43.740,0:02:46.140 and also together with George Neville Neil 0:02:46.140,0:02:51.069 writing this book ""The Design and Implementation of the FreeBSD Operating System"". 0:02:51.069,0:02:52.060 Together with this book 0:02:52.060,0:02:53.959 I developed a course 0:02:53.959,0:02:56.500 which runs for twelve chapters 0:02:56.500,0:02:58.179 and thirty hours. 0:02:58.179,0:02:59.749 The purpose of this video 0:02:59.749,0:03:01.089 is to give you a taste 0:03:01.089,0:03:02.819 of that course. 0:03:02.819,0:03:07.249 What follows are excerpts from the first lecture of the course 0:03:07.249,0:03:11.139 which of course you can also get from my website www.mckusick.com. 0:03:11.139,0:03:13.069 0:03:13.069,0:03:17.739 Enjoy. 0:03:17.739,0:03:22.239 This class is nominally about FreeBSD because well 0:03:22.239,0:03:26.379 that's what I know best and that's what the textbook is organized around 0:03:26.379,0:03:29.979 but the fact of the matter is that it's really 0:03:29.979,0:03:32.339 a class about your UNIX and that 0:03:32.339,0:03:36.539 really covers sort of the broad range of things in the open source arena as its FreeBSD 0:03:36.539,0:03:37.689 in Linux 0:03:37.689,0:03:38.899 which of course 0:03:38.899,0:03:41.159 you use a lot out 0:03:41.159,0:03:41.550 and 0:03:41.550,0:03:44.349 it also covers a commercial systems 0:03:44.349,0:03:46.950 %uh Solaris, HP-UX, 0:03:46.950,0:03:49.279 AIX and so on. 0:03:49.279,0:03:52.419 I am going to tend more towards the open side 0:03:52.419,0:03:56.389 open source side of things.So it's really going to be more FreeBSD in Linux than it's 0:03:56.389,0:03:57.579 going to be 0:03:57.579,0:04:00.849 Solaris and HP-UX and so on. 0:04:00.849,0:04:06.959 For the most part at the level of this course we're dealing with the interfaces to the system 0:04:06.959,0:04:07.329 and 0:04:07.329,0:04:11.599 the fact that the matter is a those interfaces are highly standardized at this point 0:04:11.599,0:04:12.060 and 0:04:12.060,0:04:15.280 whether it's FreeBSD or Linux or Solaris or whatever 0:04:15.280,0:04:19.460 the Socket system call has to do the same thing, it has to have the same arguments 0:04:19.460,0:04:20.150 in that, 0:04:20.150,0:04:23.909 it has to have the same effect 0:04:23.909,0:04:27.319 and so until you get down to the really nitty details 0:04:27.319,0:04:29.600 of how they actually go about implementing that 0:04:29.600,0:04:31.960 the differences are relatively minor. 0:04:31.960,0:04:35.830 So I would say that sixty to seventy percent of the material that I'm covering 0:04:35.830,0:04:40.779 is just as true for FreeBSD as it would be for Linux 0:04:40.779,0:04:42.580 or for Solaris 0:04:42.580,0:04:44.659 %uh AIX is a little bit 0:04:44.659,0:04:45.629 sort of off in the weeds 0:04:45.629,0:04:48.709 %uh as is HP-UX 0:04:48.709,0:04:51.099 but luckily we don't have to worry too much about that. 0:04:51.099,0:04:54.569 Okay so 0:04:54.569,0:04:59.279 the other thing is that I'm going to assume that all of you have used the system. I get 0:04:59.279,0:05:00.910 really sort of worried when people 0:05:00.910,0:05:04.249 you know raise the hands and ""Hey, what's a Shell?"" 0:05:04.249,0:05:07.990 or I don't put a lot of code up but a one piece of code and someone said ""Why 0:05:07.990,0:05:11.819 are there two pipe symbols in the middle of that that If statement?"". 0:05:11.819,0:05:15.740 No we're not programming the Shell we're programming in C. 0:05:15.740,0:05:19.970 So hopefully you can tell the difference between Shell scripts and C code. 0:05:19.970,0:05:21.990 so okay but I am but am gonna assume 0:05:21.990,0:05:24.610 you haven't really looked inside the system. 0:05:24.610,0:05:28.289 So I gonna start everything to at a very high level. 0:05:28.289,0:05:32.969 The problem is I have already discovered you come from a lot of different sort of 0:05:32.969,0:05:33.819 backgrounds 0:05:33.819,0:05:35.180 and 0:05:35.180,0:05:36.280 levels of knowledge 0:05:36.280,0:05:37.900 and so 0:05:37.900,0:05:42.620 the way that I find works best to sort of be useful to everybody is that three pass 0:05:42.620,0:05:43.860 algorithm 0:05:43.860,0:05:49.060 so what I will do is start the first pass a very broad brush high level 0:05:49.060,0:05:50.569 description of what's going on 0:05:50.569,0:05:54.719 and then I will go back and i'll go through the same material again but at a lower level of 0:05:54.719,0:05:55.300 detail 0:05:55.300,0:05:59.939 then i finally go back and go through a very nittily low-level of detail 0:05:59.939,0:06:04.649 and the fact of this is if you are learning new stuff as I'm doing the high-level thing 0:06:04.649,0:06:08.649 you are gonna be utterly washed by the time I get to low level niggly details 0:06:08.649,0:06:10.699 but since I'm going to do it topic by topic 0:06:10.699,0:06:14.190 when I get to the end of one of those nearly low level niggly details 0:06:14.190,0:06:17.900 i'll give you a clue as i will say ""Brain reset, I'm starting a new topic"" so even if 0:06:17.900,0:06:19.330 you're completely lost 0:06:19.330,0:06:23.530 you can now start listening again plus I'm gonna get the broad brush up again. 0:06:23.530,0:06:27.059 okay and for those of you that know a lot of this stuff already 0:06:27.059,0:06:31.770 you'll probably find the broad brush rather boring 0:06:31.770,0:06:35.759 but by the time we get down to nearly low level details I think you'll actually 0:06:35.759,0:06:37.860 pick up some things that you will find 0:06:37.860,0:06:39.710 useful and interesting. 0:06:39.710,0:06:43.759 So in this way hopefully everybody will get some 0:06:43.759,0:06:47.699 useful percentage of material out of the course. 0:06:47.699,0:06:49.599 I am gonna start out by just 0:06:49.599,0:06:53.089 walking through and giving you the 0:06:53.089,0:06:56.919 outline of what we're going to try and do here here 0:06:56.919,0:07:01.169 As i said we're going to go roughly 0:07:01.169,0:07:03.270 just about two-and-an-half hours of lecture 0:07:03.270,0:07:04.729 about two hours forty minutes 0:07:04.729,0:07:06.499 per week 0:07:06.499,0:07:07.619 and 0:07:07.619,0:07:11.770 so we will start off this week with an introduction. 0:07:11.770,0:07:13.860 This is as I said we're going to start from the top 0:07:13.860,0:07:15.749 and then just start working our way down 0:07:15.749,0:07:19.350 so the general thing I'm going to do is to talk about the interface 0:07:19.350,0:07:21.439 %uh which is something that you 0:07:21.439,0:07:25.319 are presumably fairly familiar with since you've worked with that system 0:07:25.319,0:07:27.249 and then 0:07:27.249,0:07:29.739 you have to sort of layout terminology 0:07:29.739,0:07:32.080 although we use normal english words 0:07:32.080,0:07:34.419 they have 0:07:34.419,0:07:38.580 sometimes rather bizarre meanings compared to their common usage 0:07:38.580,0:07:39.220 and 0:07:39.220,0:07:42.330 so I will just sort of lay out the terminology lay out the 0:07:42.330,0:07:45.750 the way we talk about how the system is structured 0:07:45.750,0:07:50.780 and this week we will also talk about the basic services ""What is it that the kernel is 0:07:50.780,0:07:52.929 providing for us?"" 0:07:52.929,0:07:54.060 and then of course 0:07:54.060,0:07:58.499 we'll proceed to dive down in and and see how that is done 0:07:58.499,0:07:59.970 so here in 0:07:59.970,0:08:01.400 Week number 2 0:08:01.400,0:08:05.450 we're gonna look at the system from the perspective of 0:08:05.450,0:08:07.039 something that 0:08:07.039,0:08:08.720 manages processes. 0:08:08.720,0:08:12.170 One way of looking at the kernel is it's really just a 0:08:12.170,0:08:16.440 the resource manager and the resource that its managing are things going to do with processes 0:08:16.440,0:08:19.460 So we'll look at a process, what the structure of it is 0:08:19.460,0:08:20.649 and 0:08:20.649,0:08:23.559 talk about the different ways that they can be structured. 0:08:23.559,0:08:28.379 Process can for example be an address space and can have one thread running in it can have 0:08:28.379,0:08:29.749 multiple threads running in it. 0:08:29.749,0:08:34.620 so we'll talk about the different ways that we think a process is. 0:08:34.620,0:08:38.480 We will look at the management of those processes 0:08:38.480,0:08:39.239 we've got 0:08:39.239,0:08:42.020 to lay out the bits and pieces that need to be managed 0:08:42.020,0:08:44.660 and then talk about 0:08:44.660,0:08:47.190 how we do that. 0:08:47.190,0:08:51.740 we'll talk about jails.. this is something that you currently find only in FreeBSD 0:08:51.740,0:08:55.060 hasn't made it into 0:08:55.060,0:08:56.320 Linux yet although 0:08:56.320,0:09:01.630 the concept is being actively worked on so my guess is that you'll see that 0:09:01.630,0:09:03.500 fairly soon. 0:09:03.500,0:09:06.360 we'll also then talk about scheduling 0:09:06.360,0:09:10.579 which is in essence how we decide what gets to run, when it gets to run, how long it gets 0:09:10.579,0:09:13.500 to run, etc. 0:09:13.500,0:09:14.330 okay 0:09:14.330,0:09:19.020 The week after that we will go into virtual memory. 0:09:19.020,0:09:23.800 Signals aren't really part of virtual memory but they didn't fit into next week's 0:09:23.800,0:09:26.400 material so I just would dropped that at the beginning 0:09:26.400,0:09:29.850 but the bulk of Week 3 is going to be 0:09:29.850,0:09:32.019 the management of Virtual Memory. So we've got 0:09:32.019,0:09:35.119 a bunch of physical memory, a bunch of processes that are 0:09:35.119,0:09:37.940 trying to use their address spaces 0:09:37.940,0:09:39.590 and we will talk about 0:09:39.590,0:09:41.410 essentially how you will make that all work 0:09:41.410,0:09:43.510 It's called a virtual memory because it's 0:09:43.510,0:09:47.420 sort of a cheat. We promise you the world and then we deliver you 0:09:47.420,0:09:51.480 as small number of pages as we think we can get away with. 0:09:51.480,0:09:56.420 Okay. So the first three weeks then essentially get us through 0:09:56.420,0:09:58.340 looking at the world as if it was all 0:09:58.340,0:10:00.560 all about processes. 0:10:00.560,0:10:03.880 Then in Week 4 we change gears. we say okay well you know 0:10:03.880,0:10:07.570 the kernel isn't just all about processes. You can sort of look at it orthogonally and you can 0:10:07.570,0:10:10.000 say it's really just a giant I/O switch 0:10:10.000,0:10:12.910 it's just like a traffic cop that's just managing these 0:10:12.910,0:10:14.860 I/O streams 0:10:14.860,0:10:15.450 and 0:10:15.450,0:10:18.610 so let's look at it from that perspective. 0:10:18.610,0:10:19.310 And 0:10:19.310,0:10:24.740 we'll start with special files, again this sort of the interface when you talk about UNIX 0:10:24.740,0:10:25.880 systems, when you talk about 0:10:25.880,0:10:27.950 what's normally /dev 0:10:27.950,0:10:34.170 interface that gets you access to the various I/O streams that are available 0:10:34.170,0:10:37.220 and we'll look at how that's organized and the structure of it 0:10:37.220,0:10:41.840 which used to be fairly simple but in the last decade has gotten 0:10:41.840,0:10:43.670 incredibly complicated. 0:10:43.670,0:10:48.540 We will also talk about pseudo terminals in job control 0:10:48.540,0:10:53.330 this is about as interesting as watching the grass grow but unfortunately it's 0:10:53.330,0:10:55.490 a major component of the system 0:10:55.490,0:10:59.520 and especially people that deal with system administration have to know far more about 0:10:59.520,0:11:06.520 this than they probably ever thought they wanted to. 0:11:06.900,0:11:11.430 Okay we will then continue in Week 5 with the kernel I/O structure, 0:11:11.430,0:11:16.090 We will start with multiplexing of I/O. The kernel of course has done this 0:11:16.090,0:11:17.360 always 0:11:17.360,0:11:22.110 but we're really talking more about how do we export I/O multiplexing to 0:11:22.110,0:11:25.970 user applications. 0:11:25.970,0:11:29.250 We will then move into auto configuration strategy 0:11:29.250,0:11:31.370 Auto configuration 0:11:31.370,0:11:32.770 is what happens 0:11:32.770,0:11:36.619 typically or historically I guess you could say as the system boots. 0:11:36.619,0:11:39.500 so all that stuff that comes out about 0:11:39.500,0:11:40.810 what 0:11:40.810,0:11:43.550 hardwares are on the machine and how it's all interconnected 0:11:43.550,0:11:47.350 all of that is tied up in auto configuration 0:11:47.350,0:11:50.040 and that used to happen just once it boots 0:11:50.040,0:11:52.000 but in modern systems today 0:11:52.000,0:11:55.839 it's an ongoing process. It happens at boot but it also happens 0:11:55.839,0:12:00.550 anytime you plug a new I/O device, a PCMCIA card, 0:12:00.550,0:12:03.680 or you remove a disk or you put in a new disk. 0:12:03.680,0:12:07.010 or any sort of activity that changes the I/O 0:12:07.010,0:12:08.360 structure of the machine 0:12:08.360,0:12:10.870 auto configuration has to get fired back up 0:12:10.870,0:12:13.050 and figure out what's disappeared 0:12:13.050,0:12:18.330 and cleanup and figure out what new has arrived to configure it in. 0:12:18.330,0:12:19.320 and then we'll talk 0:12:19.320,0:12:23.870 a little bit about the configuration of the device driver 0:12:23.870,0:12:27.390 this actually gets into an area that 0:12:27.390,0:12:28.660 is 0:12:28.660,0:12:33.440 one well let me just give it as a bit of advice to the class esspecially those of 0:12:33.440,0:12:36.780 you who work in system administration. 0:12:36.780,0:12:42.010 You really want to be careful that you don't learn too much about device drivers 0:12:42.010,0:12:44.670 because there is really these three things that 0:12:44.670,0:12:48.580 it's not good to learn about and if you do learn about it it's really good to keep it 0:12:48.580,0:12:49.740 to yourself 0:12:49.740,0:12:51.949 because if you become an expert or 0:12:51.949,0:12:54.960 viewed as an expert in any of these areas 0:12:54.960,0:12:59.370 you will become the designated stuccy for that and your site you'll never get to do 0:12:59.370,0:13:01.760 anything 0:13:01.760,0:13:02.610 but that 0:13:02.610,0:13:07.360 so The three things that I highly recommend not learning very much about are 0:13:07.360,0:13:09.060 device drivers, 0:13:09.060,0:13:12.320 send mail configuration files 0:13:12.320,0:13:13.970 or anything having to do 0:13:13.970,0:13:19.350 with LDAP or anything in that general domain 0:13:19.350,0:13:22.660 because as I say 0:13:22.660,0:13:24.900 that will become your life's work 0:13:24.900,0:13:25.920 and 0:13:25.920,0:13:32.920 there's other things that you might find more interesting. ""Do you have a question?"" 0:13:33.870,0:13:36.659 so one of my students empathizes with my point 0:13:36.659,0:13:39.640 I believe you said you worked on that mail system 0:13:39.640,0:13:43.120 so you you might know something about Sendmail configuration files but you don't 0:13:43.120,0:13:47.850 have to answer that 0:13:47.850,0:13:52.100 okay so we're going to talk about what a device driver does and really just sort of the entry 0:13:52.100,0:13:53.170 points to it 0:13:53.170,0:13:57.180 but we're not going to talk about how you write such a thing, how you debug such a thing 0:13:57.180,0:14:01.490 or much of anything about it. I actually used to teach an entire class believe it or not 0:14:01.490,0:14:02.720 about device drivers 0:14:02.720,0:14:05.849 but then I realized the error of my ways and I have since 0:14:05.849,0:14:12.580 gone through and made a point of forgetting every slide in that talk. 0:14:12.580,0:14:16.860 okay so then we will move on to File system 0:14:16.860,0:14:21.540 and as always we'll start at the high level talk about the interface what is it that is 0:14:21.540,0:14:23.020 exported out of the system 0:14:23.020,0:14:27.840 and then we will start diving down in the C and how do we go about implementing that 0:14:27.840,0:14:29.010 so 0:14:29.010,0:14:31.010 we'll start with the 0:14:31.010,0:14:32.560 so called 0:14:32.560,0:14:33.680 Block I/O system 0:14:33.680,0:14:36.140 it's historically been called buffer cache 0:14:36.140,0:14:38.590 and you still hear it called that periodically 0:14:38.590,0:14:42.720 and the fact of the matter is that there isn't really about buffer cache anymore, there is just one big 0:14:42.720,0:14:44.620 cache in it.Its the VM cache 0:14:44.620,0:14:47.810 and the Filesystem has a view into it and 0:14:47.810,0:14:50.829 the processes have a view into it but at the end of the day 0:14:50.829,0:14:54.660 you really don't want the same information on two different 0:14:54.660,0:14:56.030 pages of memory 0:14:56.030,0:14:59.390 because that just leads to trouble. 0:14:59.390,0:15:03.390 But Filesystems think they have buffers and so there's this manouver where we make 0:15:03.390,0:15:06.149 these things that look like what historically were buffers 0:15:06.149,0:15:08.830 that really just map into VM system 0:15:08.830,0:15:11.720 but they're still managed in the way that they have been 0:15:11.720,0:15:15.020 managed historically 0:15:15.020,0:15:20.670 okay We will then get down into Filesystem implementation the local file system if you will 0:15:20.670,0:15:23.400 and into also 0:15:23.400,0:15:25.730 soft updates and snapshots. 0:15:25.730,0:15:26.440 this 0:15:26.440,0:15:31.100 for the time being is something that you see only in FreeBSD 0:15:31.100,0:15:35.310 the alternative to soft updates is journalling which is %uh more commonly used 0:15:35.310,0:15:39.630 for example what is used by ext3 0:15:39.630,0:15:41.179 and so i'll go through soft updates and 0:15:41.179,0:15:45.260 a lot of the issues in soft updates are the same issues that you have to deal with journalling 0:15:45.260,0:15:48.370 what is it that we're protecting and how do we go about doing that 0:15:48.370,0:15:51.150 and the difference is in the detail. 0:15:51.150,0:15:54.630 There is actually a paper in the back to your notes if this is something that interests 0:15:54.630,0:15:55.240 you 0:15:55.240,0:15:59.930 it's a comparison of journalling versus soft updates that was done 0:15:59.930,0:16:02.120 about five or eight years ago. 0:16:02.120,0:16:08.460 and not to spoil the punch line but the answers they both work about are the same 0:16:08.460,0:16:12.500 Okay snapshots again is something that if 0:16:12.500,0:16:15.920 you've worked with things like the network appilance box you're probably quite 0:16:15.920,0:16:19.640 aware of what snapshots are and how they do or don't work for you 0:16:19.640,0:16:21.959 this is the same functionality 0:16:21.959,0:16:27.380 in the Filesystem implemented in a somewhat different way 0:16:27.380,0:16:28.449 okay so this 0:16:28.449,0:16:31.940 Week 6 is really going to be the local file system 0:16:31.940,0:16:34.750 the disk connected to the machine that we are dealing with. 0:16:34.750,0:16:39.140 Week 7 then we get into multiple Filesystem support so how do we abstract out that 0:16:39.140,0:16:41.190 Filesystem layer 0:16:41.190,0:16:46.430 and support Multiple Filesystems at the same time so for example in FreeBSD 0:16:46.430,0:16:50.199 you can of course run with their traditional fast Filesystem 0:16:50.199,0:16:54.540 but if you happen to like the Linux Filesystem better or you have to share a disk 0:16:54.540,0:16:55.690 with a Linux machine 0:16:55.690,0:16:58.310 you can run the ext2 or ext3 0:16:58.310,0:17:01.020 and it will perfectly happily do that 0:17:01.020,0:17:01.620 so 0:17:01.620,0:17:05.589 we will have to look then at how do we provide interface so that we can plug in all these different 0:17:05.589,0:17:09.260 Filesystems that we want to support 0:17:09.260,0:17:12.250 another area of which there's been a great 0:17:12.250,0:17:15.309 deal of growth at least in code complexity 0:17:15.309,0:17:17.840 is so-called Volume Management 0:17:17.840,0:17:19.370 so in the 0:17:19.370,0:17:24.480 good old days a Filesystem lived on a disk or piece of disk and that was that 0:17:24.480,0:17:26.130 but in this day and age 0:17:26.130,0:17:31.150 that won't do any more so we aggregate disks together by striping them or RAID 0:17:31.150,0:17:31.980 arraying them 0:17:31.980,0:17:33.380 or various other things 0:17:33.380,0:17:39.210 and we need a whole layer in the system just to manage those disks 0:17:39.210,0:17:44.280 we'll then get to the as an example of an alternative Filesystem we're going to talk about the 0:17:44.280,0:17:46.530 Network Filesystem or NFS 0:17:46.530,0:17:48.500 but that's not because this is 0:17:48.500,0:17:51.090 the world's best remote file system 0:17:51.090,0:17:55.240 or the cleanest design or any of the properties you might hope that 0:17:55.240,0:17:57.049 such a class as this one would have 0:17:57.049,0:17:58.600 but it's ubiquitous 0:17:58.600,0:18:00.210 very widely used 0:18:00.210,0:18:01.350 and 0:18:01.350,0:18:06.850 so we're going to talk about that one 0:18:06.850,0:18:07.740 okay we'll 0:18:07.740,0:18:10.970 then once again switch gears in Week 8 0:18:10.970,0:18:17.120 and turn our attention to of Networking and Interprocess communication 0:18:17.120,0:18:18.200 and 0:18:18.200,0:18:23.210 again we'll start from the very top so we'll go through, we'll go with concepts, the terminology 0:18:23.210,0:18:24.450 that gets used 0:18:24.450,0:18:30.230 and what's the difference between domain based addressing and an address domain you know 0:18:30.230,0:18:30.910 we'll go through 0:18:30.910,0:18:34.910 what the basic IPC services are, 0:18:34.910,0:18:39.080 essentially what are all the system calls that have anything to do with networking 0:18:39.080,0:18:40.590 and 0:18:40.590,0:18:43.720 just sort of describe what each of them are and I'm going to go through 0:18:43.720,0:18:45.830 a somewhat contrived example 0:18:45.830,0:18:49.840 that makes use of every one of those interfaces 0:18:49.840,0:18:52.860 and just to sort of show how they all connect together 0:18:52.860,0:18:54.169 and for those of you that work 0:18:54.169,0:18:57.400 in networking or had done any kind of network programming 0:18:57.400,0:19:00.480 if you're looking for a week to miss and the Week 8 is the one to miss that's 'cause that's 0:19:00.480,0:19:02.780 the sort of most basic 0:19:02.780,0:19:04.210 lecture that I'm going to give 0:19:04.210,0:19:07.910 If you are not sure whether or not you need to go through that, there is 0:19:07.910,0:19:09.540 one of the papers in the back 0:19:09.540,0:19:12.620 it is an introduction to Interprocess communication 0:19:12.620,0:19:18.279 read that paper if you say yeah yeah yeah yeah yeah you are done with Week 8. 0:19:18.279,0:19:20.590 on the other hand if you dont come to Week 8 0:19:20.590,0:19:22.790 and then in Week 9 I say 0:19:22.790,0:19:26.860 I call on you and say alright what is it 0:19:26.860,0:19:30.560 that listen system call does and you can't tell me 0:19:30.560,0:19:32.610 you're gonna get a demerit 0:19:32.610,0:19:34.340 okay 0:19:34.340,0:19:37.770 then in Week 9 we will get into the actual 0:19:37.770,0:19:41.419 networking implementation itself, we go through system layers as we did 0:19:41.419,0:19:43.310 in all the other areas 0:19:43.310,0:19:44.130 and 0:19:44.130,0:19:48.330 we will spend a significant portion of that class talking about routing 0:19:48.330,0:19:50.230 routing 0:19:50.230,0:19:53.610 for those of you that haven't had the pleasure of dealing with it 0:19:53.610,0:19:55.540 is a black art 0:19:55.540,0:19:58.050 or at least a dark science 0:19:58.050,0:19:59.170 and 0:19:59.170,0:19:59.930 so 0:19:59.930,0:20:02.490 we'll talk about it 0:20:02.490,0:20:06.270 from the perspective first of all of what do we do locally within the machine 0:20:06.270,0:20:10.090 and then what are some of the bigger strategies that we can use for doing routing 0:20:10.090,0:20:11.910 enterprise 0:20:11.910,0:20:14.840 wide routing or 0:20:14.840,0:20:20.190 area wide routing something like throughout the state of California or throughout the US whatever 0:20:20.190,0:20:25.379 this again like device drivers is really just sort of a nickel 0:20:25.379,0:20:26.480 tour through the 0:20:27.800,0:20:31.820 what the choices are what that the basic strategies are that are used 0:20:31.820,0:20:33.989 If you're thinking you're going to walk out of here 0:20:33.989,0:20:36.110 knowing how to set up a routing well sorry 0:20:36.110,0:20:38.430 we are not going to get that far 0:20:38.430,0:20:41.559 but you should at least have a pretty good idea of what the issues are 0:20:41.559,0:20:44.430 and what the general solutions are 0:20:44.430,0:20:48.950 okay then finally in Week 10 well not finally but next few weeks and 0:20:48.950,0:20:52.380 we will go through the Internet Protocols 0:20:52.380,0:20:54.320 primarily TCP/IP 0:20:54.320,0:20:56.560 and this is 0:20:56.560,0:20:58.809 what are the algorithms that are used 0:20:58.809,0:21:01.030 and I'm putting a particular emphasis 0:21:01.030,0:21:03.050 for this particular class 0:21:03.050,0:21:05.080 on 0:21:05.080,0:21:07.730 changes that have been made in the protocols 0:21:07.730,0:21:14.310 to deal with a lot of the sort of attacks that we've been seeing the SYN attacks and 0:21:14.310,0:21:16.880 that sort of thing 0:21:16.880,0:21:19.440 rather than just a straight 0:21:19.440,0:21:22.440 iteration of what the the actual protocols are 0:21:22.440,0:21:24.940 i'll talk primarily about IPv4 0:21:24.940,0:21:31.940 but I will also try and talk a bit about IPv6 as well 0:21:33.510,0:21:35.850 all right so the first ten weeks are 0:21:35.850,0:21:38.100 sort of the kernel course 0:21:38.100,0:21:40.800 now we attack two weeks at the end 0:21:40.800,0:21:42.010 to talk about 0:21:42.010,0:21:43.990 sort of the bigger picture of 0:21:43.990,0:21:48.240 System Tuning,Crash dump analysis that level of thing 0:21:48.240,0:21:52.940 The idea is to really consolidate what we figured out or talked about in the first 0:21:52.940,0:21:54.710 ten weeks and 0:21:54.710,0:21:58.760 how that applies to tools that we have available to us to 0:21:58.760,0:22:00.760 look at what the system is doing, 0:22:00.760,0:22:02.649 analyze what the system is doing 0:22:02.649,0:22:03.650 and hopefully 0:22:03.650,0:22:04.720 improve 0:22:04.720,0:22:07.130 the performance of what the system is doing 0:22:07.130,0:22:07.750 and 0:22:07.750,0:22:12.169 for the most part the kind of tuning that I'm talking about is not 0:22:12.169,0:22:14.740 going in and hack hack hacking your kernel 0:22:14.740,0:22:16.510 because the fact that the matter is 0:22:16.510,0:22:18.600 most of the time you can't do that anyway 0:22:18.600,0:22:22.340 so it's more looking at it from the perspective of saying 0:22:22.340,0:22:26.390 is this system running badly because it doesn't have enough memory on it? 0:22:26.390,0:22:29.470 or is it running badly because there isn't enough I/O capacity? 0:22:29.470,0:22:33.549 or is it running badly because it's got enough I/O capacity but 0:22:33.549,0:22:35.940 certain drives are being overloaded 0:22:35.940,0:22:37.309 or is it 0:22:37.309,0:22:42.220 being overrun because we're simply trying to do too much on this machine?,etc. 0:22:42.220,0:22:45.440 so that's the sort of level of thing that we're looking at it 0:22:45.440,0:22:47.080 but tied into 0:22:47.080,0:22:52.130 lot of concepts that we talked before so we can talk about active virtual memory 0:22:52.130,0:22:53.710 and what that means 0:22:53.710,0:22:55.120 and 0:22:55.120,0:22:58.750 essentially measure what it is and hopefully then you will understand in the context of what 0:22:58.750,0:23:00.690 we talked about in the VM section 0:23:00.690,0:23:03.990 what that really means 0:23:03.990,0:23:07.460 the Crash dump analysis is one of these topics that 0:23:07.460,0:23:08.730 you are gonna love or hate 0:23:08.730,0:23:12.530 you actually have to deal with crashed dumps 0:23:12.530,0:23:13.679 its people find it invaluable 0:23:13.679,0:23:15.580 and if you don't have to deal with Crash dumps 0:23:15.580,0:23:18.790 it's an incredible mass of boring detail 0:23:18.790,0:23:23.240 the only good part of it is that that's the whole session is only about an hour long 0:23:23.240,0:23:25.529 If it interests you, listen closely 0:23:25.529,0:23:28.950 and if it bores you, well, its only an hour long 0:23:28.950,0:23:32.880 okay lastly we'll talk a little bit about security issues 0:23:32.880,0:23:36.250 again this is really more to the tools that are available 0:23:36.250,0:23:40.750 to deal with security staff as opposed to a complete tutorial on 0:23:40.750,0:23:45.120 how to implement security so those of you that deal with security 0:23:45.120,0:23:48.400 this is just gonna to be sort of security one oh one 0:23:48.400,0:23:50.029 for those of you 0:23:50.029,0:23:51.500 that have but 0:23:51.500,0:23:54.399 you'll have to deal with it but haven't really thought about it 0:23:54.399,0:23:58.549 it'll probably scare you to death and you wonder how to keep the machines from 0:23:58.549,0:24:02.840 being hijacked everyday 0:24:02.840,0:24:08.030 Okay so that's in essence what we're going to try and do here 0:24:08.030,0:24:15.030 anybody have any comments, questions, thoughts. No? All right well. 0:24:16.130,0:24:17.840 Let's get started 0:24:17.840,0:24:22.180 we will be begin on page fifteen with an overview of the kernel. 0:24:22.180,0:24:26.040 Hopefully nobody's lost yet. 0:24:26.040,0:24:29.310 What's a kernel? All right. 0:24:29.310,0:24:31.370 so starting at the very top 0:24:31.370,0:24:33.070 the big broad brush 0:24:33.070,0:24:35.140 what we have is 0:24:35.140,0:24:38.330 a UNIX virtual machine and 0:24:38.330,0:24:41.660 virtual machines are actually something that has been around 0:24:41.660,0:24:44.539 as a concept since the sixties 0:24:44.539,0:24:48.919 difference is really just sort of the level of the interface that people have dealt with 0:24:48.919,0:24:51.360 when they talk about Virtual Machines 0:24:51.360,0:24:53.610 in the 1960s 0:24:53.610,0:24:56.770 computers were these enormous things you would have 0:24:56.770,0:24:58.870 your computer room would be something that'd be 0:24:58.870,0:25:01.909 three times the size of this conference room if you had 0:25:01.909,0:25:03.230 a computer 0:25:03.230,0:25:05.530 the computer itself was 0:25:05.530,0:25:07.840 tall as a refrigerator freezer 0:25:07.840,0:25:08.950 imagine 0:25:08.950,0:25:13.909 five or eight or ten of these units side by side that itself made up the computer 0:25:13.909,0:25:16.080 that would be one big 0:25:16.080,0:25:20.030 for the core processor and the one which should be the floating point unit and several 0:25:20.030,0:25:24.080 of them that would be the memory the core momory literally the core memory 0:25:24.080,0:25:29.110 and then they'd be other rows of these disk drives which were about the size of the washing 0:25:29.110,0:25:29.660 machine 0:25:29.660,0:25:34.169 and then behind that since you couldn't store everything on disks so 0:25:34.169,0:25:36.300 then you had rows of tape drives 0:25:36.300,0:25:37.880 and then you had this little 0:25:37.880,0:25:39.610 set of sort of 0:25:39.610,0:25:43.330 munchkins that would run around and and tend to the machine and they'd mount tapes and take 0:25:43.330,0:25:46.710 off tapes and mount disc packs and remove disc packs because 0:25:46.710,0:25:49.760 the drives themselves were very expensive and so 0:25:49.760,0:25:53.110 you wouldn't just as today we have a 0:25:53.110,0:25:56.090 one spindle that was dedicated just to one set of platters 0:25:56.090,0:25:57.130 you could take out a 0:25:57.130,0:25:59.460 set of platters and put in another 0:25:59.460,0:26:02.540 hundred megabytes set of platters and these are platters that are 0:26:02.540,0:26:05.280 this big around and it's like six or eight of them and 0:26:05.280,0:26:09.140 giant head assemblies they comes rumbling in and out 0:26:09.140,0:26:12.440 anyway one of these giant giant machines 0:26:12.440,0:26:17.380 that costs many millions of dollars would run at about ten 0:26:17.380,0:26:21.120 million instructions per second, 10 mips 0:26:21.120,0:26:21.630 and 10 mips 0:26:21.630,0:26:28.330 was more computing power than anybody could possibly imagine using in a single application 0:26:28.330,0:26:28.880 just 0:26:28.880,0:26:31.050 by contrast you know this 0:26:31.050,0:26:34.070 four-year-old laptop here is probably on the order of 0:26:34.070,0:26:36.440 one or two hundred mips 0:26:36.440,0:26:37.140 but anyway 0:26:37.140,0:26:40.760 people couldn't really view what we would do with a lot of computing power 0:26:40.760,0:26:44.640 and the other thing was that you didn't have a notion of sort of an operating system that had 0:26:44.640,0:26:45.890 applications running on it 0:26:45.890,0:26:46.760 because 0:26:46.760,0:26:50.160 everybody wanted to write straight to the raw hardware 0:26:50.160,0:26:51.750 and so 0:26:51.750,0:26:55.900 what IBM who was a big manufacturer of machines in those days 0:26:55.900,0:26:59.060 did what they came up with this thing called the VM 0:26:59.060,0:27:00.770 and this was a little 0:27:00.770,0:27:02.549 you'd call an operating system really 0:27:02.549,0:27:05.130 but what it did is it cloned 0:27:05.130,0:27:09.270 independent copies of the machine that worked just like the original machines so you could boot 0:27:09.270,0:27:11.769 something that you thought it was an operating system 0:27:11.769,0:27:13.380 on top of VM 0:27:13.380,0:27:16.750 so you take one least ten mip machines and it would clone 0:27:16.750,0:27:20.050 six identical one mip copies 0:27:20.050,0:27:22.030 and then you could boot 0:27:22.030,0:27:24.700 whatever you wanted on each one of those machines so 0:27:24.700,0:27:29.510 if you were doing database stuff you would boot your database because database cannot ran on the raw hardware 0:27:29.510,0:27:32.920 or if you're doing payroll who would boot up the payroll program 0:27:32.920,0:27:37.950 or if you actually tried to service users you could boot a time sharing batch thing 0:27:37.950,0:27:40.790 that would read card images and print stuff out 0:27:40.790,0:27:44.460 or they even had TSO the Time Sharing Option where you could interactively sit 0:27:44.460,0:27:45.559 and type and send 0:27:45.559,0:27:47.560 stuffs in and get answers back 0:27:47.560,0:27:48.570 and 0:27:48.570,0:27:51.429 also you could boot TSO so whatever set of 0:27:51.429,0:27:52.219 0:27:52.219,0:27:55.339 things you need you could boot them and they ran independently as if they were running on their 0:27:55.339,0:27:56.470 own machine 0:27:56.470,0:28:03.150 but all the VM did was it give you an exact raw copy of the hardware 0:28:03.150,0:28:04.529 so when UNIX came along 0:28:04.529,0:28:07.350 they sort of liked the notion of 0:28:07.350,0:28:11.509 providing the concept of independent things that you could operate in 0:28:11.509,0:28:13.610 but they wanted it at a higher level 0:28:13.610,0:28:15.610 so you're looking really to do it 0:28:15.610,0:28:17.480 instead of at the raw hardware level 0:28:17.480,0:28:19.679 to do it at a process level 0:28:19.679,0:28:23.799 and the idea that then was that the interface you would program to would be what we think of as 0:28:23.799,0:28:26.090 a System call interface today 0:28:26.090,0:28:27.849 and the idea then was that 0:28:27.849,0:28:30.740 you would be given a process or set of processes 0:28:30.740,0:28:34.990 and those were independent. your process couldn't affect 0:28:34.990,0:28:38.830 the address space of another processor. You couldn't reach over and mess around with their addresses, 0:28:38.830,0:28:41.030 you couldn't mess around with their I/O channels 0:28:41.030,0:28:43.179 you could slow them down by 0:28:43.179,0:28:44.299 being a pig but 0:28:44.299,0:28:47.980 that was about the only way that you could affect other processes 0:28:47.980,0:28:48.480 and 0:28:48.480,0:28:49.830 so 0:28:49.830,0:28:52.669 what the interfaces that they had there 0:28:52.669,0:28:58.660 was one that had these characteristics had a a paged virtual address space 0:28:58.660,0:29:02.980 so you din't have to know as in the old days how much physical memory is on the machine and make your application 0:29:02.980,0:29:04.740 fit into that amount of memory 0:29:04.740,0:29:07.950 you just had what looked like a large 0:29:07.950,0:29:11.710 uniform address space even if the underlying hardware had segments or some other 0:29:11.710,0:29:13.580 hardware brain damage 0:29:13.580,0:29:17.390 it looked to you like he just had a big uniform address space and 0:29:17.390,0:29:21.070 the size of your address space was independent of the amount of memory that was on your machine 0:29:21.070,0:29:23.900 your address space couldn't be bigger than amount of physical memory 0:29:23.900,0:29:26.499 cause we sort of move pages around underneath 0:29:26.499,0:29:29.320 whatever part address space was actually active 0:29:29.320,0:29:34.260 and there's obviously limits to this if you are trying to run a 1 gigabyte of 0:29:34.260,0:29:35.630 application on top of 0:29:35.630,0:29:37.240 ten megabytes of memory 0:29:37.240,0:29:40.880 it's probably going to bring new meaning to same day service 0:29:40.880,0:29:45.519 but if you're willing to wait long enough it will eventually move the pages around and you will 0:29:45.519,0:29:49.740 progress through getting your application run 0:29:49.740,0:29:53.890 another thing was dealing with software interrupts 0:29:53.890,0:29:55.789 in the old days 0:29:55.789,0:29:58.749 you had to understand how the hardware worked 0:29:58.749,0:30:03.900 in order to deal with exceptional conditions so for example if you did a divide by zero 0:30:03.900,0:30:08.170 the hardware would jump through some vector location or 0:30:08.170,0:30:08.630 something 0:30:08.630,0:30:12.799 and you had know how that worked and make sure that you had your program 0:30:12.799,0:30:16.510 usually some little bit of assembly language set up to deal with that 0:30:16.510,0:30:19.870 and UNIX said let's let's get away from the hardware here 0:30:19.870,0:30:22.080 and so they did this thing called signals 0:30:22.080,0:30:25.700 and so they just define a set of the signals is that if you do divide by zero 0:30:25.700,0:30:29.529 you simply register a routine you want to have called you don't have to know 0:30:29.529,0:30:31.220 how the hardware figured it out 0:30:31.220,0:30:36.740 you just know that that routine is going to get called and you can deal with it at that point 0:30:36.740,0:30:40.960 well we got set of timers and counters to keep track of what we're doing, this is really more 0:30:40.960,0:30:43.490 for counting than anything else but 0:30:43.490,0:30:46.970 applications may want to have access to that. 0:30:46.970,0:30:51.720 we have a set of identifiers that we're going to use for things like accounting, 0:30:51.720,0:30:54.830 protection and scheduling and so on 0:30:54.830,0:30:55.820 and one of the 0:30:55.820,0:31:00.320 the early philosophies of UNIX was to try and keep it simple. 0:31:00.320,0:31:02.630 operating systems have gotten very baroque 0:31:02.630,0:31:04.490 in particular the thing that 0:31:04.490,0:31:07.350 pre dated UNIX was a thing called Multix 0:31:07.350,0:31:12.820 Multix was was a joint project between Honeywell, a big computer manufacturer of the 0:31:12.820,0:31:15.740 time 0:31:15.740,0:31:17.129 AT&T bell laboratories 0:31:17.129,0:31:19.750 the big industrial labratory at that time 0:31:19.750,0:31:21.380 and MIT 0:31:21.380,0:31:23.430 a big university then and 0:31:23.430,0:31:24.690 still today 0:31:24.690,0:31:29.259 and those three organizations got together to try and build this 0:31:29.259,0:31:31.400 time sharing operating system 0:31:31.400,0:31:32.280 and it 0:31:32.280,0:31:33.770 it just got bigger and 0:31:33.770,0:31:37.160 more grandiose and more complex and never finished 0:31:37.160,0:31:38.979 because as soon as they sort of see 0:31:38.979,0:31:42.709 oh we know how to do that but we could do this other thing too and so then they would tear it 0:31:42.709,0:31:43.429 apart and 0:31:43.429,0:31:46.440 they never really got to something that 0:31:46.440,0:31:48.210 could be put into production 0:31:48.210,0:31:49.919 and so the 0:31:49.919,0:31:50.570 AT&T 0:31:50.570,0:31:54.340 Bell laboratories decided to pull out of that project 0:31:54.340,0:31:55.940 and 0:31:55.940,0:32:00.000 the two of the people that had been working on that project, Ken Thompson and Dennis Richie 0:32:00.000,0:32:04.390 were sort of bummed because they were now back to typing cards and putting them through 0:32:04.390,0:32:05.259 card readers and 0:32:05.259,0:32:07.960 they had gotten used to the idea that you could actually 0:32:07.960,0:32:11.559 sit at an ASSR33 teletype and interact with your computer 0:32:11.559,0:32:13.440 and so 0:32:13.440,0:32:18.230 they found an old %uh PDP-8 sitting off in the corner that had been abandoned 0:32:18.230,0:32:22.120 and started working on this little tiny operating system which they called UNIX 0:32:22.120,0:32:26.549 which eventually moved to the PDP-11 and became what we have today 0:32:26.549,0:32:28.050 but because it was 0:32:28.050,0:32:32.120 they were coming first of all from Multix where everything had been done and 0:32:32.120,0:32:34.110 in great grandiose detail 0:32:34.110,0:32:37.549 and because they're fundamentally were two of them working on it and they wanted to get something 0:32:37.549,0:32:38.370 done and 0:32:38.370,0:32:40.130 within a year or so 0:32:40.130,0:32:41.529 one of their philosophies was 0:32:41.529,0:32:44.099 let's find the one way of doing things 0:32:44.099,0:32:48.180 let's not have eight ways from Sunday let's just get the one way 0:32:48.180,0:32:53.860 and that's what we will provide. So what is the sort of core set of things that we need. 0:32:53.860,0:32:58.620 well first thing is when it comes to identifiers, let's not have you know 0:32:58.620,0:33:00.430 eighty thousand different identifiers 0:33:00.430,0:33:03.140 so they came up with process identifiers, 0:33:03.140,0:33:09.620 user identifier and at that time a single group identifier and later expanded 0:33:09.620,0:33:14.200 and they used that sort of identifiers for everything so its used for counting, used for making 0:33:14.200,0:33:17.410 protection decisions, used for scheduling decisions 0:33:17.410,0:33:19.470 and 0:33:19.470,0:33:24.279 again it was the simplicity of thing which was what was driving their decision 0:33:24.279,0:33:28.840 but they're really sort of two key ideas that they had 0:33:28.840,0:33:30.880 that really made the difference that 0:33:30.880,0:33:32.539 that's what set them up side 0:33:32.539,0:33:34.749 from what everybody else had done before them 0:33:34.749,0:33:35.450 and which 0:33:35.450,0:33:39.740 in retrospect is something that has been pervasive more or less ever since 0:33:39.740,0:33:41.869 the first of these was the notion 0:33:41.869,0:33:44.840 that we have a unique descriptor space 0:33:44.840,0:33:46.289 that is 0:33:46.289,0:33:51.250 given a descriptor it can reference any I/O device 0:33:51.250,0:33:53.650 so or even any kind of I/O channel 0:33:53.650,0:33:58.270 so you can have a descriptor for terminal or descriptor for a file or descriptive for 0:33:58.270,0:34:02.240 a disk or descriptor for a pipe or descriptor for a socket 0:34:02.240,0:34:03.500 and 0:34:03.500,0:34:04.790 you don't need to know 0:34:04.790,0:34:07.940 what it references in order to be able to read and write that thing 0:34:07.940,0:34:11.290 so if i hand you a descriptor you can read from that the descriptor or you can write 0:34:11.290,0:34:13.259 to that descriptor 0:34:13.259,0:34:15.189 and 0:34:15.189,0:34:17.359 the correct thing will happen 0:34:17.359,0:34:19.089 and you'd say well 0:34:19.089,0:34:23.629 that's so obvious I mean how else could you possibly think of doing it? 0:34:23.629,0:34:25.179 well predating UNIX 0:34:25.179,0:34:28.059 everything was done with 0:34:28.059,0:34:29.379 a little subsystem 0:34:29.379,0:34:33.419 that would open a file, read a file, write a file, close a file 0:34:33.419,0:34:37.429 and there was another set of system calls which would open a terminal,read a terminal, write terminal, 0:34:37.429,0:34:38.089 close terminal 0:34:38.089,0:34:39.210 and yet another one 0:34:39.210,0:34:42.409 which was create a pipe,read a pipe, write a pipe and so on. 0:34:42.409,0:34:47.699 so if you are just a drop dead stupid program like say CAD 0:34:47.699,0:34:51.579 you would have to have code in there and say was my input a terminal which in case I need to 0:34:51.579,0:34:53.159 use the read terminal 0:34:53.159,0:34:57.419 or is it a file which in case i need to use read file or is it a pipe in which in case 0:34:57.419,0:34:59.189 i need to use read pipe 0:34:59.189,0:35:01.860 and so the program itself had to have all this 0:35:01.860,0:35:02.859 coding in it 0:35:02.859,0:35:04.409 whereas when they went to 0:35:04.409,0:35:07.159 the uniform descriptor space 0:35:07.159,0:35:09.630 CAD doesn't know it doesn't need to know it just says 0:35:09.630,0:35:10.819 read my input, 0:35:10.819,0:35:13.979 write the output 0:35:13.979,0:35:17.059 and it works and we add a new type of descriptor 0:35:17.059,0:35:17.600 and 0:35:17.600,0:35:21.700 CAD just continues to work just as it always did. 0:35:21.700,0:35:24.199 So this proved to be a very powerful construct 0:35:24.199,0:35:27.019 and pretty much every operating system after UNIX 0:35:27.019,0:35:28.659 did that there's 0:35:28.659,0:35:30.210 one exception of %uh 0:35:30.210,0:35:32.549 large company in the Pacific North-West 0:35:32.549,0:35:35.830 that still has not quite uniform descriptor space 0:35:35.830,0:35:38.380 but %uh that's part of their legacy that really 0:35:38.380,0:35:39.900 they're working on that. 0:35:39.900,0:35:42.009 Longhorn will be here. 0:35:42.009,0:35:43.939 and anyway 0:35:43.939,0:35:46.190 this set of facilities then 0:35:46.190,0:35:50.150 makes up the UNIX virtual machine 0:35:50.150,0:35:51.559 and 0:35:51.559,0:35:55.559 in some sense we still see virtual machines being used today in fact we're seeing sort 0:35:55.559,0:35:56.749 of a reversion 0:35:56.749,0:36:01.429 back to some of the IBM stuff in things like the VMware 0:36:01.429,0:36:03.079 which is 0:36:03.079,0:36:07.029 essentially allow you to go back to booting native operating systems again so sort of 0:36:07.029,0:36:08.280 interesting to watch 0:36:08.280,0:36:09.060 that the sort of 0:36:09.060,0:36:12.919 pendulum of back going back and forth of what's the correct layer 0:36:12.919,0:36:14.609 for for doing 0:36:14.609,0:36:18.890 virtual machines 0:36:18.890,0:36:22.499 Okay? so far so good? 0:36:22.499,0:36:24.719 all right so i said that there were 0:36:24.719,0:36:27.160 two key ideas that UNIX had 0:36:27.160,0:36:30.279 the first of these being the uniform descriptor space 0:36:30.279,0:36:35.819 the second one which was really critical was this notion of processes as a commodity 0:36:35.819,0:36:37.309 item 0:36:37.309,0:36:40.220 so here on Page 17 I've tried to lay it out 0:36:40.220,0:36:41.090 the 0:36:41.090,0:36:44.159 that the components that make up a process 0:36:44.159,0:36:45.759 and 0:36:45.759,0:36:50.359 what do I really mean when I say a process as a commodity item 0:36:50.359,0:36:53.650 okay leading up to 0:36:53.650,0:36:54.689 UNIX 0:36:54.689,0:36:56.800 the systems that pre-dated it, 0:36:56.800,0:36:59.200 processes were these very large 0:36:59.200,0:37:02.169 heavyweight expensive things 0:37:02.169,0:37:02.779 and 0:37:02.779,0:37:04.539 if you look at 0:37:04.539,0:37:08.629 MVS which was the operating system that ran on IBM for doing multiple processing 0:37:08.629,0:37:10.509 and 0:37:10.509,0:37:13.799 the system administrator would decide at boot time 0:37:13.799,0:37:17.019 what degree of multiprocessing they wish to support 0:37:17.019,0:37:18.140 so they'd say well 0:37:18.140,0:37:20.739 well, we'll let upto six things happen at once 0:37:20.739,0:37:22.490 and so as part of booting up 0:37:22.490,0:37:24.419 they would create six 0:37:24.419,0:37:25.349 processes 0:37:25.349,0:37:30.059 and now you as a user if you wanted to do something let's say you wanted to 0:37:30.059,0:37:32.009 compile and run a program 0:37:32.009,0:37:34.960 you would be given a process 0:37:34.960,0:37:36.019 and it was up to you 0:37:36.019,0:37:39.369 to figure out how to stage what you needed done 0:37:39.369,0:37:39.819 and 0:37:39.819,0:37:43.930 that this was often fairly complex 0:37:43.930,0:37:47.880 and so you would have to write out all the steps that you wanted 0:37:47.880,0:37:50.300 in this wonderful thing called JCL 0:37:50.300,0:37:52.259 Job Control Language. 0:37:52.259,0:37:56.650 Job Control Language was send mail configuration file of the sixties 0:37:56.650,0:38:00.679 there where people whose sole job at the company was how to put this stuff together 'cause 0:38:00.679,0:38:04.189 all you had to do is get one extra space or a missing comma 0:38:04.189,0:38:05.000 something in there 0:38:05.000,0:38:08.630 and the whole thing would just blow up. it would just sort of spit the card deck back at 0:38:08.630,0:38:09.799 you and say well 0:38:09.799,0:38:13.500 somewhere in there is a mistake that's sort of in the general area of this card 0:38:13.500,0:38:15.549 and I can't deal with it. Fix it. 0:38:15.549,0:38:16.489 and of course 0:38:16.489,0:38:20.550 in those days it wasn't just a matter of hitting carriage when you know make carriage return you have to 0:38:20.550,0:38:25.239 get your deck pull out the card, and type the new one, put it back in and re-submit it 0:38:25.239,0:38:28.729 As heaven forbid you couldnt touch that card reader you know, it had to be done by 0:38:28.729,0:38:29.970 an operator 0:38:29.970,0:38:32.869 so the card deck will read through it would disappear and 0:38:32.869,0:38:36.800 you know if you're lucky a few minutes later if you were not lucky a few hours later 0:38:36.800,0:38:37.849 you would get 0:38:37.849,0:38:39.570 a print out 0:38:39.570,0:38:43.419 which was what had happened and then you could look at it and you know 0:38:43.419,0:38:47.209 I put a comma in the wrong place I guess I get to do it all again 0:38:47.209,0:38:49.930 so 0:38:49.930,0:38:54.940 the thing you would need to do there for compiling and running a program 0:38:54.940,0:38:59.579 was you'd have to break into these steps. well I need to run the the preprocessor 0:38:59.579,0:39:04.670 and so clean out whatever gump that was left over on that process from the previous user 0:39:04.670,0:39:06.240 put the preprocessor in there 0:39:06.240,0:39:10.530 and then read from this file here let's say I gotta put it somewhere so creative 0:39:10.530,0:39:12.510 scratch file over on this disk and 0:39:12.510,0:39:17.299 it was excruciating detail like how many cylinders and how many tracks and this and that 0:39:17.299,0:39:19.139 blocks blah blah blah 0:39:19.139,0:39:23.119 and don't forget any of those parameters 'cause it'll spit it out if you do 0:39:23.119,0:39:26.890 and so then it would run the first step in that if its successful then you'd have sitting 0:39:26.890,0:39:28.899 in this scratch file that you had created 0:39:28.899,0:39:33.100 the output of the preprocessor and then you'd load the first pass of the compiler 0:39:33.100,0:39:36.930 and you say now read from that scratch file and create this other scratch file over here and 0:39:36.930,0:39:39.450 when thats successful and we need to delete that one 0:39:39.450,0:39:43.830 and then load the second pass, put that back into another scratch file and then we run this 0:39:43.830,0:39:45.950 assembler, and the optimizer then the 0:39:45.950,0:39:47.750 loader this and that 0:39:47.750,0:39:49.410 finally run the program 0:39:49.410,0:39:50.900 and if all goes well 0:39:50.900,0:39:57.029 you know at step sixteen out comes the answer 0:39:57.029,0:39:58.129 forty two. so UNIX 0:39:58.129,0:40:00.819 said, look this is silly 0:40:00.819,0:40:02.880 a lot of this is just 0:40:02.880,0:40:04.310 bookkeeping 0:40:04.310,0:40:07.249 and computers do bookkeeping really well 0:40:07.249,0:40:12.179 and you'll recall yeah but it's going to take all these cycles it's like 0:40:12.179,0:40:16.309 computers are supposed to be labor-saving devices right? so 0:40:16.309,0:40:20.150 they came up with this notion that they would create processes on the fly as needed 0:40:20.150,0:40:21.159 you had 0:40:21.159,0:40:25.549 you've had a preprocessor in two steps of the compiler and then 0:40:25.549,0:40:27.109 optimizer and then a loader 0:40:27.109,0:40:29.410 we just create Boom seven processes 0:40:29.410,0:40:31.920 and we connect them together with pipes 0:40:31.920,0:40:35.180 and so we take the input and you know run through in 0:40:35.180,0:40:38.270 through the pipes and you know out the end you get the the 0:40:38.270,0:40:39.629 executable 0:40:39.629,0:40:40.030 and 0:40:40.030,0:40:42.880 we will simply create each of these processes 0:40:42.880,0:40:44.650 and 0:40:44.650,0:40:46.549 so you as a user just 0:40:46.549,0:40:49.479 type you know the C compiler and it just 0:40:49.479,0:40:52.429 fork these things pipe them together got the result 0:40:52.429,0:40:53.640 and 0:40:53.640,0:40:57.509 then once it was done with this processes is just threw them away so any time you'd create a 0:40:57.509,0:41:00.479 new process and it came to you pristine clean 0:41:00.479,0:41:04.239 and you needed a bunch of things it did put everything in intermediate files 0:41:04.239,0:41:07.549 the fact of the matter is in the early days 0:41:07.549,0:41:08.129 those computers 0:41:08.129,0:41:11.910 didn't really have enough memory to support all that stuff at once so 0:41:11.910,0:41:15.809 behind you those pipes were actually implemented as files 0:41:15.809,0:41:19.319 but you didn't have atleast to remember to create them and delete them 0:41:19.319,0:41:20.200 and deal with them 0:41:20.200,0:41:24.020 as far as you were concerned it just look stuff flowing through pipes and of course today it 0:41:24.020,0:41:24.490 just 0:41:24.490,0:41:27.989 does flow through pipes in memory 0:41:27.989,0:41:29.439 okay so 0:41:29.439,0:41:33.689 this notion then that that we're just gonna create processes on the fly is needed and 0:41:33.689,0:41:35.559 connect them together as needed 0:41:35.559,0:41:38.039 it was a novel concept 0:41:38.039,0:41:43.599 and it wasn't that somehow mysteriously figured out how to create processes cheaply 0:41:43.599,0:41:44.839 cause they hadn't 0:41:44.839,0:41:46.180 they were still 0:41:46.180,0:41:49.959 really expensive to create 0:41:49.959,0:41:52.210 but that extra effort 0:41:52.210,0:41:53.029 was 0:41:53.029,0:41:56.089 worth it because it was saving a lot of programming time 0:41:56.089,0:41:59.809 so my favorite example is you run ls 0:41:59.809,0:42:01.810 so we have to create a process 0:42:01.810,0:42:04.259 load the ls binary into it 0:42:04.259,0:42:06.180 it prints a line or two on your screen 0:42:06.180,0:42:10.609 and we tear the entire thing down and return all its resources back to the system 0:42:10.609,0:42:14.979 more than ninety percent of the cost of running ls is creating and destroying the process 0:42:14.979,0:42:19.239 a tiny fraction of it is actually running ls 0:42:19.239,0:42:24.259 but it goes so fast, who cares right 0:42:24.259,0:42:25.749 so the point is that 0:42:25.749,0:42:30.039 that concept of just creating things as needed 0:42:30.039,0:42:31.780 again was very powerful 0:42:31.780,0:42:35.709 and is one that is just pervasive today 0:42:35.709,0:42:38.639 okay so what is a process actually made up of 0:42:38.639,0:42:43.179 it gets some amount of CPU time or at least we do dearly hope that it gets some 0:42:43.179,0:42:46.050 amount of CPU time, the lack of getting CPU time 0:42:46.050,0:42:46.670 that makes it 0:42:46.670,0:42:47.979 a computer so sluggish 0:42:47.979,0:42:49.409 of course 0:42:49.409,0:42:51.920 others really boils down to scheduling 0:42:51.920,0:42:54.249 and we're going to talk about scheduling 0:42:54.249,0:42:56.279 probably more than you care to 0:42:56.279,0:42:59.219 in a couple weeks time 0:42:59.219,0:43:01.619 we have the asynchronous events 0:43:01.619,0:43:04.569 these are the external events that 0:43:04.569,0:43:05.659 are coming in 0:43:05.659,0:43:07.679 so 0:43:07.679,0:43:10.169 they may be either things that 0:43:10.169,0:43:14.339 were coming in from the outside world like start, stop and quit 0:43:14.339,0:43:15.279 oh 0:43:15.279,0:43:18.170 out-of-band data arrival notification that kind of thing 0:43:18.170,0:43:22.339 or it may in fact be things that the program is bringing down upon itself 0:43:22.339,0:43:25.590 such as a segment fault,a divide by zero 0:43:25.590,0:43:26.910 and some other 0:43:26.910,0:43:31.959 what would normally be viewed as incorrect operation 0:43:31.959,0:43:35.849 and so we'll talk about that when we talk about signals 0:43:35.849,0:43:37.039 every program 0:43:37.039,0:43:38.899 gets some amount of memory 0:43:38.899,0:43:42.659 it gets an initial amount when it starts up injured generally allocates more as it 0:43:42.659,0:43:45.229 goes along 0:43:45.229,0:43:49.429 this of course we will deal with very extensively will spend an entire week on it 0:43:49.429,0:43:54.249 when we talk about how virtual memory is implemented 0:43:54.249,0:43:54.609 and 0:43:54.609,0:43:57.429 then we get I/O descriptors 0:43:57.429,0:44:02.259 I used to say that every program had to have at least one I/O descriptor since 0:44:02.259,0:44:04.910 it absolutely had no input 0:44:04.910,0:44:06.329 absolutely no output 0:44:06.329,0:44:09.049 then it was sort of pointless 0:44:09.049,0:44:12.900 of course I had to have one of my students come up and point out to me there is an a 0:44:12.900,0:44:13.849 class of programs 0:44:13.849,0:44:16.469 which don't need I/O descriptors 0:44:16.469,0:44:17.440 and that is 0:44:17.440,0:44:19.549 these things called benchmarks 0:44:19.549,0:44:23.249 it just compute something all we really care about is how long it takes them to compute 0:44:23.249,0:44:24.959 we dont actually care what the answer is 0:44:24.959,0:44:26.019 In theory we dont 0:44:26.019,0:44:29.779 I personally like my benchmark stop with something so I can see it there 0:44:29.779,0:44:31.489 doing computing the right thing 0:44:31.489,0:44:33.169 but in theory 0:44:33.169,0:44:35.919 that wouldn't be necessary 0:44:35.919,0:44:38.650 outside of that class of programs 0:44:38.650,0:44:42.670 everything needs some sort of descriptors and of course we'll talk about descriptors 0:44:42.670,0:44:43.659 quite extensively 0:44:43.659,0:44:47.349 as we go through the I/O subsystem 0:44:47.349,0:44:50.969 okay so the executive summary is that processes are 0:44:50.969,0:44:54.969 the fundamental service that is provided by UNIX 0:44:54.969,0:44:58.430 and 0:44:58.430,0:45:02.849 what we're going to spend essentially the next two and a half weeks working on 0:45:02.849,0:45:04.769 is 0:45:04.769,0:45:07.079 what what makes up processes 0:45:07.079,0:45:10.180 we'll go into much more detail about each of these four points 0:45:10.180,0:45:11.769 and 0:45:11.769,0:45:13.630 then how do we actually go about 0:45:13.630,0:45:14.390 providing that 0:45:14.390,0:45:16.639 bit of service 0:45:16.639,0:45:17.900 the next thing that I'm 0:45:17.900,0:45:22.210 going to do now is this go through and lay out some of the terminology that 0:45:22.210,0:45:23.239 we have when 0:45:23.239,0:45:25.130 we're talking about processes 0:45:25.130,0:45:29.229 so this is sort of the big picture here were on page eighteen 0:45:29.229,0:45:30.669 and 0:45:30.669,0:45:33.669 you can see we have sort of three bits that make up 0:45:33.669,0:45:36.640 the system 0:45:36.640,0:45:39.029 we have the currently running user process 0:45:39.029,0:45:41.180 and then what we call the top half of the kernel 0:45:41.180,0:45:43.699 and the bottom half of the kernel 0:45:43.699,0:45:47.049 now this would be a picture for a uniprocessor 0:45:47.049,0:45:49.299 so one CPU 0:45:49.299,0:45:51.209 if we had a multiprocessor 0:45:51.209,0:45:54.009 %uh then we would have 0:45:54.009,0:45:57.130 one instance of the kernel 0:45:57.130,0:45:59.529 but multiple instances of the user process 0:45:59.529,0:46:02.879 but for any given CPU on a multiprocessor 0:46:02.879,0:46:05.709 it is running exactly one process 0:46:05.709,0:46:09.309 so you may think they we're running for four-five processes all at once 0:46:09.309,0:46:14.319 but the fact of the matter is that any instant in time there's only one process which is 0:46:14.319,0:46:16.299 actually running 0:46:16.299,0:46:18.609 and 0:46:18.609,0:46:21.429 that is the one that we have loaded in the system 0:46:21.429,0:46:25.199 now we give the illusion that were running lots of things because we switch between them 0:46:25.199,0:46:26.100 rather quickly 0:46:26.100,0:46:29.269 so it looks like things are happening in all windows at once 0:46:29.269,0:46:31.430 but in reality 0:46:31.430,0:46:33.619 that's not really happening 0:46:33.619,0:46:36.440 okay so there is a set of properties that I want to look at 0:46:36.440,0:46:40.899 that had to do with each one of these parts here 0:46:40.899,0:46:44.359 but just to sort of look at it from the big picture perspective 0:46:44.359,0:46:45.970 what you see here 0:46:45.970,0:46:47.180 is 0:46:47.180,0:46:51.549 there is boundary between the user process and the top half of the kernel 0:46:51.549,0:46:54.949 which is really just like a glorified sovereignty call 0:46:54.949,0:46:59.539 it's a lot like calling into a library routine like calling strcat, strcpy or something 0:46:59.539,0:47:00.319 like that 0:47:00.319,0:47:03.679 when you do a system call 0:47:03.679,0:47:05.650 we take that same set of parameters 0:47:05.650,0:47:08.009 now this is sort of 0:47:08.009,0:47:09.780 brick Wall here if you will 0:47:09.780,0:47:11.380 that is protecting 0:47:11.380,0:47:13.680 the top half of the kernel 0:47:13.680,0:47:15.299 from the application 0:47:15.299,0:47:18.899 I'll go more into some detail about how that actually gets implemented 0:47:18.899,0:47:22.729 but in essense you can think of it is is there sort of this whaling Wall and these little 0:47:22.729,0:47:24.990 chinks there and you can sort of push a request through 0:47:24.990,0:47:28.230 and somebody other sides sort of pulls that looks at it and decides whether they're going 0:47:28.230,0:47:28.690 to 0:47:28.690,0:47:30.769 dain to provide service to you 0:47:30.769,0:47:34.229 and if they do then they sort of send it back 0:47:34.229,0:47:37.649 well like a library where you can just sort of reach in and walk around if you want to 0:47:37.649,0:47:38.290 you 0:47:38.290,0:47:40.950 good programming practices you don't do that but 0:47:40.950,0:47:43.049 you could 0:47:43.049,0:47:44.579 all right so 0:47:44.579,0:47:49.089 the the top half of the kernel is really looks a lot like 0:47:49.089,0:47:50.509 a big library 0:47:50.509,0:47:53.509 %uh it just happens to be a library routines 0:47:53.509,0:47:57.599 that deal with things where processes need to interact with each other 0:47:57.599,0:48:01.399 in fact for many people they don't understand for what's the difference between the C 0:48:01.399,0:48:03.259 library and the top half of the kernel 0:48:03.259,0:48:08.020 if it's something that you're doing that no other process needs to know about 0:48:08.020,0:48:09.799 then it can be in the C library 0:48:09.799,0:48:13.829 so if you call strcat to concatenate two strings together 0:48:13.829,0:48:17.599 nobody else needs to know you're doing that you don't need to coordinate with anybody 0:48:17.599,0:48:19.000 else that you're doing that 0:48:19.000,0:48:20.160 it's just happening 0:48:20.160,0:48:21.979 so that goes in the C library. 0:48:21.979,0:48:24.489 on the other hand if you're reading or writing the file 0:48:24.489,0:48:28.029 there may be other processes that are also reading and writing that file 0:48:28.029,0:48:29.910 and therefore that 0:48:29.910,0:48:31.579 has to be done by the kernel 0:48:31.579,0:48:33.120 because they can coordinate 0:48:33.120,0:48:37.189 all the different processes that are trying to access that file. 0:48:37.189,0:48:40.529 so the top half of the kernel is pretty straightforward code 0:48:40.529,0:48:45.539 it looks a lot like any other library that you would write if you look at top half kernel 0:48:45.539,0:48:49.640 code you know you see all read,come in it's got these parameters we Mark around we 0:48:49.640,0:48:53.719 get some data that we put it in the buffer and we return back 0:48:53.719,0:48:57.470 and in fact writing code for the top half of the kernel is 0:48:57.470,0:48:59.729 not all that difficult to do 0:48:59.729,0:49:00.989 it's 0:49:00.989,0:49:01.959 you have 0:49:01.959,0:49:05.939 for many of the same properties that you would when you're writing user level application 0:49:05.939,0:49:07.529 code 0:49:07.529,0:49:11.779 the bottom half of the kernel is where things start to get nasty 0:49:11.779,0:49:14.820 because the bottom half of the kernel is the part of the system 0:49:14.820,0:49:18.769 that deals with all of the asynchronous events in the system 0:49:18.769,0:49:22.179 is things like device drivers, 0:49:22.179,0:49:23.779 timers 0:49:23.779,0:49:25.010 that level of thing 0:49:25.010,0:49:28.029 that are driven by hardware events 0:49:28.029,0:49:28.659 so 0:49:28.659,0:49:31.459 for example a packet arrives on the network 0:49:31.459,0:49:33.670 that causes an interrupt to come and 0:49:33.670,0:49:36.729 that will be handled by the bottom half of the kernel 0:49:36.729,0:49:38.829 and historically 0:49:38.829,0:49:43.079 when an interrupt came in it preempted whatever else was going on 0:49:43.079,0:49:45.400 and it ran until it finished and then it returned 0:49:45.400,0:49:46.539 and it could not 0:49:46.539,0:49:49.439 go to sleep to wait for resources or other things 0:49:49.439,0:49:51.339 %uh in current systems 0:49:51.339,0:49:54.549 you can actually go to sleep in the interrupt driver and waiting for 0:49:54.549,0:49:56.739 some other activity to complete 0:49:56.739,0:49:58.259 it is however 0:49:58.259,0:50:00.799 not a good idea to do that 0:50:00.799,0:50:01.909 because 0:50:01.909,0:50:06.739 the usual case of most device drivers is they can finish whatever they're doing in an interrupt 0:50:06.739,0:50:08.579 without ever blocking 0:50:08.579,0:50:09.580 and so 0:50:09.580,0:50:13.649 when an interrupt comes in we assume that you're not going to sleep 0:50:13.649,0:50:14.710 and if you actually 0:50:14.710,0:50:17.219 then go to sleep.oh man 0:50:17.219,0:50:20.469 you didnt tell us you're going to do this we have to go off to do a whole lot of other work 0:50:20.469,0:50:23.029 that we had originally planned on doing 0:50:23.029,0:50:25.460 so if you go to sleep in a device driver 0:50:25.460,0:50:28.209 you are taking a very serious performance hit 0:50:28.209,0:50:31.019 so it's highly recommended that you don't do that 0:50:31.019,0:50:33.130 but if you have to you can 0:50:33.130,0:50:35.809 on it's because of this historic behavior or 0:50:35.809,0:50:39.899 of not being able to sleep in the bottom half of the kernel 0:50:39.899,0:50:42.119 that you have certain properties that have 0:50:42.119,0:50:44.769 taken over in device drivers 0:50:44.769,0:50:45.940 and that is 0:50:45.940,0:50:50.369 that a device driver should be handed all the resources it needs to get his job done 0:50:50.369,0:50:54.490 you don't give a disk device driver Go read this 0:50:54.490,0:50:56.549 and put it somewhere 0:50:56.549,0:50:57.580 you have to say 0:50:57.580,0:50:59.410 Go read this particular block 0:50:59.410,0:51:02.650 here is a chunk of memory that I want that data to put in 0:51:02.650,0:51:03.959 and 0:51:03.959,0:51:06.169 notify me when it's done 0:51:06.169,0:51:06.970 because 0:51:06.970,0:51:10.660 things like allocating memory are classic places where you end up having to go to sleep 0:51:10.660,0:51:12.939 to wait for stuff to happen 0:51:12.939,0:51:14.449 and 0:51:14.449,0:51:16.390 historically you couldn't do that 0:51:16.390,0:51:18.640 even currently don't want to have to do that 0:51:18.640,0:51:23.400 so device drivers generally have all resources pre allocated 0:51:23.400,0:51:25.169 and then they can just go 0:51:25.169,0:51:27.279 the one place where this doesn't work 0:51:27.279,0:51:29.029 is the network 0:51:29.029,0:51:30.929 and in particular 0:51:30.929,0:51:34.630 you don't know when somebody's going to send packets to you 0:51:34.630,0:51:37.040 you say well you're looking to open connections 0:51:37.040,0:51:39.360 but if you're doing something like IP forwarding 0:51:39.360,0:51:40.969 there's no 0:51:40.969,0:51:45.039 top half state it's dealing with this packets they're just coming in on one interface being 0:51:45.039,0:51:46.719 sent out on another interface 0:51:46.719,0:51:50.630 they never pass through any part of the top half of the kernel 0:51:50.630,0:51:53.529 and so in the case of network device drivers 0:51:53.529,0:51:56.149 they need to allocate memory 0:51:56.149,0:51:56.640 and 0:51:56.640,0:51:58.829 if memory gets into short supply 0:51:58.829,0:52:01.689 and they try to allocate memory and it's not available 0:52:01.689,0:52:05.049 they historically coudnt wait for memory to be available 0:52:05.049,0:52:08.380 and even in practice today don't wait 0:52:08.380,0:52:09.580 for memory to become available 0:52:09.580,0:52:12.469 they simply drop the packet on the floor 0:52:12.469,0:52:18.109 it's like well I didn't have any place to put it sorry oops 0:52:18.109,0:52:20.940 now that doesn't cause incorrect behavior 0:52:20.940,0:52:24.369 because the higher level protocols will retransmit 0:52:24.369,0:52:29.140 but it does cause great performance problems because retransmission means that connections 0:52:29.140,0:52:29.879 stall 0:52:29.879,0:52:31.110 they have to back up 0:52:31.110,0:52:33.010 they have to resend data 0:52:33.010,0:52:33.739 and so on 0:52:33.739,0:52:38.739 so you really want to avoid dropping packets if you can possibly help it 0:52:38.739,0:52:42.029 and consequently 0:52:42.029,0:52:43.420 we tend to 0:52:43.420,0:52:46.499 pre allocate a certain amount of memory for the network drivers 0:52:46.499,0:52:48.299 and 0:52:48.299,0:52:52.169 we try very hard to make sure that we're not going to run out of memory but 0:52:52.169,0:52:54.869 if packets come fast enough and we can't deal with them 0:52:54.869,0:52:57.940 as quickly as they are arriving then over short period of time 0:52:57.940,0:53:03.489 we get to the point where we simply have to start dropping packets 0:53:03.489,0:53:07.649 okay this is a part of kernel that you do not wish to write code for 0:53:07.649,0:53:10.919 because it is extremely difficult to debug 0:53:10.919,0:53:12.759 you get these bugs where 0:53:12.759,0:53:18.779 the only time it happens is on the third Tuesday when there's a full moon 0:53:18.779,0:53:19.300 and 0:53:19.300,0:53:24.199 we have a disk interrupt followed by %uh a terminal character coming in 0:53:24.199,0:53:28.289 and the network packet arriving of size fifteen twenty two 0:53:28.289,0:53:30.109 and when all those things happened 0:53:30.109,0:53:32.719 the system panics 0:53:32.719,0:53:37.380 and of course there's like it panics cause you're following some bad pointer 0:53:37.380,0:53:40.969 something that should have been there but was freed some time in the distant past 0:53:40.969,0:53:42.930 we are not sure when 0:53:42.930,0:53:44.049 and 0:53:44.049,0:53:47.400 try to debug things like that is extremely difficult 0:53:47.400,0:53:48.509 and you can 0:53:48.509,0:53:52.120 think well I think I found the problem but it's not reproduceable 0:53:52.120,0:53:55.530 you know you have to wait for the next third Tuesday with a full moon and blah blah blah 0:53:55.530,0:53:56.950 to happen 0:53:56.950,0:53:57.469 and 0:53:57.469,0:54:01.449 you know so you sort of statistically guess that you fix that you know I was getting 0:54:01.449,0:54:03.510 this bug once every three days 0:54:03.510,0:54:06.099 and now it's gone for two weeks without happening 0:54:06.099,0:54:07.239 did you fix that? 0:54:07.239,0:54:08.969 or if you've been lucky 0:54:08.969,0:54:10.459 and and it's 0:54:10.459,0:54:14.349 that coupled with the fact that you're dealing with hardware 0:54:14.349,0:54:18.049 and hardware rarely works the way it's documented to work 0:54:18.049,0:54:21.770 and so you know they're doing everything that it says you're supposed to do 0:54:21.770,0:54:26.260 it still doesn't work because you didn't set the fiddle bit over on that other place over 0:54:26.260,0:54:26.660 there 0:54:26.660,0:54:30.479 that's not documented anywhere but if it's not said it doesn't work 0:54:30.479,0:54:33.769 occasionally 0:54:33.769,0:54:36.110 so this is another reason that you really want of avoid 0:54:36.110,0:54:40.459 dealing with this part of the system if you can possibly help 0:54:40.459,0:54:44.369 okay but lets go through and and look at some of the properties here starting up at 0:54:44.369,0:54:45.789 the user process 0:54:45.789,0:54:47.980 we're running with 0:54:47.980,0:54:51.449 preemptive scheduling 0:54:51.449,0:54:53.409 now there's several caveats here 0:54:53.409,0:54:55.239 preemptive scheduling is the default 0:54:55.239,0:54:56.970 so called shared scheduler 0:54:56.970,0:55:01.360 that is what you normally use there are other schedulers like the real time scheduler 0:55:01.360,0:55:02.869 where what I'm saying isnt that true 0:55:02.869,0:55:05.709 we'll talk about some of the schedulers was later 0:55:05.709,0:55:09.930 but the usual scheduler that you're running on under UNIX is a shared scheduler 0:55:09.930,0:55:13.229 and under the shared scheduler user applications 0:55:13.229,0:55:15.159 run with pre emptive scheduling 0:55:15.159,0:55:17.449 and pre emptive scheduling means that 0:55:17.449,0:55:20.019 you run at the whim of the system 0:55:20.019,0:55:21.420 if it wants you to run 0:55:21.420,0:55:22.140 you run 0:55:22.140,0:55:25.490 once you to start running you have no guarantee of how long you're going to run 0:55:25.490,0:55:29.370 it might like to run for three instructions and then decide it doesn't like you many more 0:55:29.370,0:55:31.150 it wants to run something else 0:55:31.150,0:55:35.920 or you might get to run for several seconds and in a row with the with no intervening 0:55:35.920,0:55:37.469 things interrupting you 0:55:37.469,0:55:39.719 you just don't know 0:55:39.719,0:55:40.969 and 0:55:40.969,0:55:42.839 really all you know is 0:55:42.839,0:55:43.569 that 0:55:43.569,0:55:48.239 they claim that they're using statistics and that and that the statistics are fair 0:55:48.239,0:55:55.059 and so on average you're going to get a reasonable amount of time but thats 0:55:55.059,0:55:57.129 up to the system you don't control that 0:55:57.129,0:55:58.439 the real point here 0:55:58.439,0:56:01.940 is that you don't have any way of creating a critical section 0:56:01.940,0:56:04.950 you can't say okay I don't want to be interrupted 0:56:04.950,0:56:07.429 during this particular sequence of things 0:56:07.429,0:56:09.809 so you have to program 0:56:09.809,0:56:13.469 assuming that you may be interrupted at any point 0:56:13.469,0:56:14.979 okay 0:56:14.979,0:56:18.909 the next thing is that when you're running in a user process 0:56:18.909,0:56:20.719 you are running in 0:56:20.719,0:56:24.150 with the processor in what's called unprivileged mode 0:56:24.150,0:56:28.109 one of the requirements for running any kind of a UNIX system 0:56:28.109,0:56:31.759 is that you have to have a processor that support privileged and unprivileged 0:56:31.759,0:56:33.709 two different modes of operation 0:56:33.709,0:56:37.049 in privileged mode which is what the kernel runs in 0:56:37.049,0:56:38.950 the entire repertoire 0:56:38.950,0:56:40.869 of the hardware is available 0:56:40.869,0:56:45.339 by this I mean you can set all the registers you can fiddle with the memory management 0:56:45.339,0:56:47.460 unit you can initiate I/O 0:56:47.460,0:56:50.519 you can access any memory anywhere 0:56:50.519,0:56:51.919 etc 0:56:51.919,0:56:56.540 when you're running in unprivileged mode which is what user processes run in and 0:56:56.540,0:57:00.709 this a large subset of the instructions which you cannot execute 0:57:00.709,0:57:03.480 you cannot initiate I/O on 0:57:03.480,0:57:04.209 devices 0:57:04.209,0:57:06.770 you cannot change the memory mapping 0:57:06.770,0:57:10.209 you cannot access memory that's not part of your address space 0:57:10.209,0:57:13.299 you cannot execute certain instructions like halt 0:57:13.299,0:57:15.589 and 0:57:15.589,0:57:19.039 so in general you are protected 0:57:19.039,0:57:21.789 from manipulating anything that's outside of your address space 0:57:21.789,0:57:23.759 this of course is desirable because 0:57:23.759,0:57:27.059 when you're running in this unprevileged mode 0:57:27.059,0:57:28.300 you're protected 0:57:28.300,0:57:31.910 from other processes manipulating you and they're protected from you manipulating 0:57:31.910,0:57:33.079 them 0:57:33.079,0:57:36.430 for those of you that have had that misfortune to have to use 0:57:36.430,0:57:39.339 early versions of windows up to about ninety eight 0:57:39.339,0:57:42.470 they always ran with the processor running in privileged mode 0:57:42.470,0:57:44.009 even in applications 0:57:44.009,0:57:46.459 and so either maliciously or accidentally 0:57:46.459,0:57:50.000 you could stop on other people address space or you could stop on the kernel 0:57:50.000,0:57:53.020 and a lot of the blue screen of death was people just 0:57:53.020,0:57:56.319 following wild pointers and trashing different parts of the system 0:57:56.319,0:57:58.819 taking everything down 0:57:58.819,0:58:00.020 it also makes it 0:58:00.020,0:58:02.320 far easier to 0:58:02.320,0:58:05.459 implement things like viruses and worms and other things because 0:58:05.459,0:58:09.619 a user application can we rewrite the boot block on the disk they can just the write down 0:58:09.619,0:58:13.109 and manipulate the registers that allow them to do whatever they want 0:58:13.109,0:58:16.730 whereas when you're running in unprivileged mode you cant write those kinds of 0:58:16.730,0:58:20.179 of things 0:58:20.179,0:58:24.119 so modern versions of Windows anything from about 2000 on 0:58:24.119,0:58:26.630 now run with privileged and unprevileged mode 0:58:26.630,0:58:28.649 but UNIX has always required that 0:58:28.649,0:58:30.219 and so when you're running an 0:58:30.219,0:58:31.319 user process 0:58:31.319,0:58:33.389 you cannot block i mean 0:58:33.389,0:58:37.969 you cannot execute the instructions which cause a context switching to occur 0:58:37.969,0:58:40.349 you can't pick what's going to run next 0:58:40.349,0:58:43.140 you can't make that thing run next all you can do 0:58:43.140,0:58:45.189 is go to the operating system and say 0:58:45.189,0:58:49.269 hey I've got nothing to do. pick somebody else to run 0:58:49.269,0:58:53.449 and the operating system is the think they can then execute the instructions which cause 0:58:53.449,0:58:57.609 a different process to be loaded 0:58:57.609,0:58:59.049 and run 0:58:59.049,0:59:03.400 alright.finally while you're in a user application you're running on a user stack 0:59:03.400,0:59:06.410 that's part of the user's address space 0:59:06.410,0:59:07.889 so 0:59:07.889,0:59:10.819 part of creating a process gives you a runtime stack 0:59:10.819,0:59:14.369 as part of a virtual address space and so it can be 0:59:14.369,0:59:18.199 more or less up to the limits of the hardware as big as you want it to be 0:59:18.199,0:59:19.949 so if you are running on thirty two-bit processor 0:59:19.949,0:59:22.819 you're stack can get the 2 gigabytes 0:59:22.819,0:59:23.319 and 0:59:23.319,0:59:26.839 the what this means is that anytime you allocate local variables 0:59:26.839,0:59:28.529 you don't have to worry about Oh 0:59:28.529,0:59:30.609 is that gonna overrun my stack? 0:59:30.609,0:59:31.610 so if you need 0:59:31.610,0:59:35.519 a hundred thousand double precision floating point numbers 0:59:35.519,0:59:37.189 you can just as a local variable allocate 0:59:37.189,0:59:40.269 an array of size a hundred-thousand type double 0:59:40.269,0:59:44.029 and it just decrements your stack pointer by hundred hundred thousand bytes 0:59:44.029,0:59:45.009 away you go 0:59:45.009,0:59:47.299 it's just virtual address space 0:59:47.299,0:59:49.020 as you'll see when we get into the kernel 0:59:49.020,0:59:50.210 that ceases to be the case