diff options
Diffstat (limited to 'en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv')
-rw-r--r-- | en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv | 1645 |
1 files changed, 0 insertions, 1645 deletions
diff --git a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv deleted file mode 100644 index bf785730a9..0000000000 --- a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv +++ /dev/null @@ -1,1645 +0,0 @@ -0:00:00.530,0:00:01.590 -So basically, - -0:00:04.590,0:00:10.029 -we are going to look, mainly in this second part, -at how to - -0:00:10.029,0:00:11.519 -handle some - -0:00:11.519,0:00:12.560 -locking problems - -0:00:12.560,0:00:17.910 -that categorize in the kernel. - -0:00:17.910,0:00:24.410 -Here, there are described two kinds of problems -you can get with locks, that are pretty much common. - -0:00:24.410,0:00:27.859 -The first one is called Lock Order Reversal (LOR). - -0:00:27.859,0:00:30.140 -When you have for example a thread A, - -0:00:30.140,0:00:32.340 -which owns - -0:00:32.340,0:00:35.870 -a lock code, for example L1 - -0:00:35.870,0:00:37.920 -and another thread B - -0:00:37.920,0:00:40.070 -which owns the lock, L2 - -0:00:40.070,0:00:43.150 -Then thread A tries to.. - -0:00:43.150,0:00:44.730 -Right.. it's wrong. - -0:00:44.730,0:00:46.220 -The slide is wrong. - -0:00:46.220,0:00:48.020 -The slide is wrong. - -0:00:48.020,0:00:51.910 -Thread A tries to acquire L2, - -0:00:51.910,0:00:55.670 -but obviously it sleeps because -it is owned by thread B - -0:00:55.670,0:00:58.530 -and then thread B tries to acquire - -0:00:58.530,0:01:00.030 -the lock L1 - -0:01:00.030,0:01:02.240 -and it sleeps because - -0:01:02.240,0:01:06.440 -it's owned by thread B. - -0:01:06.440,0:01:11.410 -This is a situation that never ends and -it's pretty much well documented in Cormen - -0:01:11.410,0:01:16.150 -and in traditional literature. - -0:01:16.150,0:01:18.650 -It's a classical deadlock. - -0:01:18.650,0:01:19.960 -This means that, - -0:01:19.960,0:01:21.950 -as everybody who - -0:01:21.950,0:01:24.940 -has ever read a book - -0:01:24.940,0:01:25.899 -about an operating system, - -0:01:25.899,0:01:30.420 -knows that - -0:01:30.420,0:01:32.910 -locks should maintain - -0:01:32.910,0:01:34.319 -an ordering in regard of each other. - -0:01:34.319,0:01:38.859 -That's not very simple when - -0:01:38.859,0:01:40.100 -you speak about a kernel. - -0:01:40.100,0:01:44.850 -From this point of view, the fact -that there are 3 kinds of classes of locks - -0:01:44.850,0:01:49.180 -is going to count because you can -never mix two different kinds of locks. - -0:01:49.180,0:01:50.680 -For example - -0:01:50.680,0:01:51.610 -a spinlock - -0:01:51.610,0:01:53.770 -and a mutex - -0:01:53.770,0:01:59.120 -can be mixed in this way. - -0:01:59.120,0:02:01.720 -You can have the mutex first and the spinlock later, -while the opposite is not actually true. - -0:02:01.720,0:02:07.060 -So, you will see that these kind -of deadlocks are possible - -0:02:07.060,0:02:09.290 -only in the same class of locks, - -0:02:09.290,0:02:13.019 -like for example 2 mutex or 2 spin mutex, - -0:02:13.019,0:02:14.569 -or such. - -0:02:14.569,0:02:16.090 - - -0:02:16.090,0:02:17.409 -Also, - -0:02:17.409,0:02:19.949 -even if it's not very well documented, - -0:02:19.949,0:02:22.880 -for example spinlocks - -0:02:22.880,0:02:26.599 -in FreeBSD, have a way to -identify such kind of deadlocks. - -0:02:26.599,0:02:27.619 -And it's pretty much implemented. - -0:02:27.619,0:02:29.709 - - -0:02:29.709,0:02:32.449 -It's a feature enabled in the code. - -0:02:32.449,0:02:34.949 -They just count how many times they are spinning - -0:02:34.949,0:02:36.010 -and - -0:02:36.010,0:02:39.169 -if it exceeds - -0:02:39.169,0:02:41.379 -an exaggerated result, - -0:02:41.379,0:02:47.870 -it means that they are probably -under a deadlock and the system panics. - -0:02:47.870,0:02:52.489 -Another common problem about locking -is when you have - -0:02:52.489,0:02:55.189 -a wait channel - -0:02:55.189,0:02:56.659 -like a condition variable (cond var), - -0:02:56.659,0:02:58.849 -which it is protected by a lock. - -0:02:58.849,0:03:03.629 -If there are some ratio chances that -this condition variable encounters, - -0:03:03.629,0:03:05.489 -for example, - -0:03:05.489,0:03:07.219 -when cond var are chased - -0:03:07.219,0:03:12.649 -with some preliminary conditions, -like a waiter's counter - -0:03:12.649,0:03:16.359 -that has to be updated - -0:03:16.359,0:03:21.249 -anytime that any thread tries to sleep -on the counter or flags to be set. - -0:03:21.249,0:03:24.909 -If these conditions are not protected -by the same lock, - -0:03:24.909,0:03:30.569 -you can end up having some threads -sleeping on this wait channel - -0:03:30.569,0:03:34.589 -and nobody is going to wake them up again. - -0:03:34.589,0:03:37.629 -This is usually called missed wakeup - -0:03:37.629,0:03:41.249 -and it's a pretty common mistake - -0:03:41.249,0:03:44.799 -that leads to a deadlock. - -0:03:44.799,0:03:46.719 -The problem is that - -0:03:46.719,0:03:52.109 -it's very difficult to differentiate -between missed wakeup and - -0:03:52.109,0:03:53.480 -for example - -0:03:53.480,0:03:56.189 -forever sleep - -0:03:56.189,0:03:58.419 -of a thread - -0:03:58.419,0:04:01.859 -that is not likely to be awaken. - -0:04:01.859,0:04:07.109 -So these kind of deadlocks are -very very difficult to be discovered - -0:04:07.109,0:04:11.669 -and will require some bit of -work that we will see right now. - -0:04:11.669,0:04:14.509 -For example, - -0:04:14.509,0:04:15.270 -using - -0:04:15.270,0:04:16.219 -some - -0:04:16.219,0:04:18.179 -kernel systems - -0:04:18.179,0:04:22.240 -and some things integrated into the debugger. - -0:04:22.240,0:04:22.979 - - -0:04:22.979,0:04:25.520 -In FreeBSD, - -0:04:25.520,0:04:29.859 -we have quite a lot of good mechanisms -we can use to cope - -0:04:29.859,0:04:32.080 -with kernel problems. - -0:04:32.080,0:04:36.539 -The first one (and the most important) -is called WITNESS. - -0:04:36.539,0:04:39.169 -It was introduced - -0:04:39.169,0:04:42.080 -in the context of SMPng - -0:04:42.080,0:04:44.979 -and has been rewritten in the recent past, - -0:04:44.979,0:04:47.919 -mainly by a contribution of - -0:04:47.919,0:04:51.360 -Isilon systems. - -0:04:51.360,0:04:52.270 -They contributed back then - -0:04:52.270,0:04:54.989 -to the writing of WITNESS. - -0:04:54.989,0:04:57.389 -This subsystem is very important - -0:04:57.389,0:05:02.730 -because it tracks down exactly every order - -0:05:02.730,0:05:03.949 -of the locks. - -0:05:03.949,0:05:07.810 -So that, if there is an ordering violation like a LOR, - -0:05:07.810,0:05:09.439 -it's going to - -0:05:09.439,0:05:12.150 -tell the system. - -0:05:12.150,0:05:18.029 -You can even set it to directly panic if -it finds some deadlocks, - -0:05:18.029,0:05:19.879 -or only - -0:05:19.879,0:05:22.729 -some possible deadlocks. - -0:05:22.729,0:05:27.260 -Another important feature with it is - -0:05:27.260,0:05:32.569 -that it can keep track of read/write locks. - -0:05:32.569,0:05:33.690 -Doing that, - -0:05:33.690,0:05:36.539 -we can identify - -0:05:36.539,0:05:38.419 -deadlocks, possibly - -0:05:38.419,0:05:39.500 -even - -0:05:39.500,0:05:40.690 -on the - -0:05:40.690,0:05:45.529 -reader's path. - -0:05:45.529,0:05:49.609 -We could say that WITNESS is pretty big, - -0:05:49.609,0:05:52.289 -so activating it - -0:05:52.289,0:05:55.039 -in your production system is never an option. - -0:05:55.039,0:05:59.929 -It's mainly used when you are going to -develop a new feature in the kernel - -0:05:59.929,0:06:02.110 -and you are going to test it heavily. - -0:06:02.110,0:06:05.479 -In particular if it has - -0:06:05.479,0:06:06.819 -some - -0:06:06.819,0:06:10.509 -relation to locking. - -0:06:10.509,0:06:13.089 - - -0:06:13.089,0:06:17.840 -We could also tell that with the new code -provided by Isilon and Nokia, - -0:06:17.840,0:06:19.150 -basically - -0:06:19.150,0:06:21.689 - - -0:06:21.689,0:06:25.479 -offered by WITNESS is greatly reduced to about - -0:06:25.479,0:06:27.699 -the 10th part of - -0:06:27.699,0:06:30.240 -what we had before. - -0:06:30.240,0:06:36.150 -WITNESS is very good at tracking LOR, - -0:06:36.150,0:06:37.849 -but - -0:06:37.849,0:06:40.009 -it's not very good at, for example, - -0:06:40.009,0:06:42.449 -trying to - -0:06:42.449,0:06:44.060 -help you - -0:06:44.060,0:06:47.479 -in the case of lost wakeups, - -0:06:47.479,0:06:49.519 -because of its nature, - -0:06:49.519,0:06:52.090 -mainly. - -0:06:52.090,0:06:55.889 -It has a very good integration with the DDB debugger - -0:06:55.889,0:06:57.740 -and - -0:06:57.740,0:06:58.879 -basically - -0:06:58.879,0:07:04.159 -it's in the 8th release, -we have new features - -0:07:04.159,0:07:05.759 -that help you - -0:07:05.759,0:07:08.389 -print out backtraces - -0:07:08.389,0:07:11.150 -of the contesting threads on the LORs - -0:07:11.150,0:07:16.039 -and their orderings - -0:07:16.039,0:07:17.549 -and - -0:07:17.549,0:07:23.550 -it shows some graphs of the relations -even from the user space. - -0:07:23.550,0:07:28.550 -You don't have to go into the kernel -debugger to look at it's output. - -0:07:28.550,0:07:35.550 - - -0:07:35.620,0:07:37.380 - - -0:07:37.380,0:07:42.250 -Well, I see that sometimes when -they are released there is a confusion - -0:07:42.250,0:07:44.250 -about the information reports - -0:07:44.250,0:07:48.440 -in regard of deadlocks conditions and what help - -0:07:48.440,0:07:50.020 -users can provide to developers - -0:07:50.020,0:07:52.039 -about that. - -0:07:52.039,0:07:54.020 -So we are going to see - -0:07:54.020,0:07:58.700 -all the relevant information -when a deadlock - -0:07:58.700,0:07:59.590 -is in the kernel. - -0:07:59.590,0:08:02.490 - - -0:08:02.490,0:08:03.389 -Usually, - -0:08:03.389,0:08:07.939 -if you want to find a deadlock -that's happening in the kernel, - -0:08:07.939,0:08:10.909 -your first line of analysis starts from the DDB - -0:08:10.909,0:08:13.919 -instead of a post-mortem analysis, - -0:08:13.919,0:08:16.839 -which is even more important. - -0:08:16.839,0:08:22.330 -But, using DDB you will get more -processes and better information. - -0:08:22.330,0:08:24.970 - - -0:08:24.970,0:08:28.499 -The most important unit in order to find the deadlock - -0:08:28.499,0:08:34.389 -are the LORs reported by WITNESS in order -to see if there is something strange - -0:08:34.389,0:08:36.690 -that can be happening. - -0:08:36.690,0:08:41.700 -You want to know the state of all the threads -that are running on the system that is deadlocking. - -0:08:41.700,0:08:42.900 - - -0:08:42.900,0:08:47.050 -You can see that you're deadlocking, if you see that - -0:08:47.050,0:08:48.070 -on the runqueue - -0:08:48.070,0:08:48.540 -there are - -0:08:48.540,0:08:51.850 -just idle threads. - -0:08:51.850,0:08:56.640 -Because it's like saying that -runqueues are complete flushed - -0:08:56.640,0:09:02.450 -and you have all the threads sleeping -in their own containers. - -0:09:02.450,0:09:07.850 -You need to know which are the exact locks -that are acquired - -0:09:07.850,0:09:11.270 -in the system - -0:09:11.270,0:09:15.570 -and that's something that WITNESS provides - -0:09:15.570,0:09:20.720 -and the very important thing is -to know why the threads are stopping. - -0:09:20.720,0:09:24.250 -So one of the most important things is -retrieving what the threads were doing - -0:09:24.250,0:09:26.320 -when - -0:09:26.320,0:09:28.960 -they were put asleep. - -0:09:28.960,0:09:30.070 - - -0:09:30.070,0:09:33.009 -The backtraces of all the threads involved - -0:09:33.009,0:09:37.130 -are printed out in order to identify deadlocks. - -0:09:37.130,0:09:38.589 -In the case that - -0:09:38.589,0:09:42.830 -buffered cache and VFS are - -0:09:42.830,0:09:45.910 -probably parts of the deadlocking, - -0:09:45.910,0:09:50.790 -you should also print out - -0:09:50.790,0:09:53.420 -the information about vnodes - -0:09:53.420,0:09:58.250 -and what we're interested in is which vnodes are called, - -0:09:58.250,0:09:59.320 -which - -0:09:59.320,0:10:01.370 -are actually referenced - -0:10:01.370,0:10:03.530 -and - -0:10:03.530,0:10:10.530 -in which way they were called. - -0:10:11.030,0:10:13.380 -So, - -0:10:13.380,0:10:15.770 -this is an example - -0:10:15.770,0:10:17.430 -of the - -0:10:17.430,0:10:18.880 -thread states - -0:10:18.880,0:10:20.760 -in the case of a deadlock. - -0:10:20.760,0:10:27.480 -This is an real example of a deadlock - -0:10:27.480,0:10:28.900 -but you can see - -0:10:28.900,0:10:31.890 -that - -0:10:31.890,0:10:35.650 -this is not totally complete. - -0:10:35.650,0:10:38.450 -But you can see that all the threads are sleeping. - -0:10:38.450,0:10:39.870 - - -0:10:39.870,0:10:43.580 -This one is the message - -0:10:43.580,0:10:44.790 -used by the wait channel - -0:10:44.790,0:10:47.550 -on which they're sleeping on - -0:10:47.550,0:10:48.710 -or used by - -0:10:48.710,0:10:54.480 -the container like the turnstile or the sleepqueue. - -0:10:54.480,0:10:59.410 -If I recall correctly, it's a forced amount -that does deadlocking at some point. - -0:10:59.410,0:11:01.290 -I'm not really sure - -0:11:01.290,0:11:04.190 -because I should have looked at it. - -0:11:04.190,0:11:08.810 -You can see that the revelant command here -is -ps - -0:11:08.810,0:11:11.220 -that DDB supports. - -0:11:11.220,0:11:14.220 - - -0:11:14.220,0:11:17.520 -Another important thing - -0:11:17.520,0:11:18.820 -is getting - -0:11:18.820,0:11:21.680 -the situation of all CPUs. - -0:11:21.680,0:11:24.100 -As you can see there, - -0:11:24.100,0:11:25.210 -usually - -0:11:25.210,0:11:31.600 -its because you can add some data structures corrupted - -0:11:31.600,0:11:34.320 -in the per-CPU datas. - -0:11:34.320,0:11:38.830 -That's a very common situation where you can get deadlocks, - -0:11:38.830,0:11:40.280 -because, for example, - -0:11:40.280,0:11:43.149 -leaving a corrupted LPD will lead - -0:11:48.750,0:11:55.290 -to you having a bigger massive breakage like -double-faults and things like that. Usually it's always a -good idea to look at all the CPUs involved in the system. - -0:11:55.290,0:11:57.310 -The command - -0:11:57.310,0:12:00.120 -is """"-show allpcpu"". - -0:12:00.120,0:12:04.960 - - -0:12:04.960,0:12:06.959 -This one - -0:12:06.959,0:12:12.009 -is a WITNESS specific command ""-show alllocks"" -and it's going to show all the locks, - -0:12:12.009,0:12:13.130 -in the system, - -0:12:13.130,0:12:15.070 -who is the owner, - -0:12:15.070,0:12:15.850 -like in this case, - -0:12:15.850,0:12:17.690 -a mount, - -0:12:17.690,0:12:21.270 -and the thread is this one, - -0:12:21.270,0:12:23.660 -what the lock is holding, - -0:12:23.660,0:12:24.970 -that's the address - -0:12:24.970,0:12:27.360 -and where it was acquired. - -0:12:27.360,0:12:31.140 -It gives you lines and file. - -0:12:31.140,0:12:32.770 - - -0:12:32.770,0:12:34.730 -Actually, - -0:12:34.730,0:12:37.620 -that's just possible - -0:12:37.620,0:12:40.859 -with WITNESS, because otherwise, - -0:12:40.859,0:12:44.410 -trying to keep the oldest information - -0:12:44.410,0:12:51.410 -in a general purpose kernel will be -very expensive for our logging subsystem. - -0:12:55.330,0:12:59.730 -Then, the most important thing is -the backtrace for any thread. - -0:12:59.730,0:13:01.150 - - -0:13:01.150,0:13:03.390 -It's going to show the backtrace - -0:13:03.390,0:13:05.700 -for all the threads. - -0:13:05.700,0:13:08.380 -the seas - -0:13:08.380,0:13:09.169 -In this case, - -0:13:09.169,0:13:13.010 -the thread with these addresses TID and PID - -0:13:13.010,0:13:15.350 -basically got sleeping - -0:13:15.350,0:13:17.140 -on a vnode. - -0:13:17.140,0:13:22.020 -You will see that the backend in this case is FFF - -0:13:22.020,0:13:24.000 -and - -0:13:24.000,0:13:25.729 -that's the context switching function, - -0:13:25.729,0:13:26.900 - - -0:13:26.900,0:13:32.220 -those are the sleepqueues of the containter -that is containing the threads, - -0:13:32.220,0:13:34.230 -and this one - -0:13:34.230,0:13:36.370 -is what it was going to do - -0:13:36.370,0:13:37.910 -before, - -0:13:37.910,0:13:41.810 -in this case mounting the filesystems. - -0:13:41.810,0:13:47.220 -You will see that on a complete feeding, - -0:13:47.220,0:13:50.310 -you will have a lot of these kinds of traces, - -0:13:50.310,0:13:53.079 -but they are very important - -0:13:53.079,0:13:59.270 -for the developers in order to understand -what is going on. - -0:13:59.270,0:14:02.590 -These ones are the locked vnodes - -0:14:02.590,0:14:05.830 -that are also very important when - -0:14:05.830,0:14:11.780 -a deadlock happens in VFS or in the buffer cache. - -0:14:11.780,0:14:13.700 -You will see - -0:14:13.700,0:14:18.580 -that these are the ref counts linked to vnodes, - -0:14:18.580,0:14:20.980 -they are specific - -0:14:20.980,0:14:23.850 -to some handling of the vnodes such as recycling, - -0:14:23.850,0:14:26.020 -and completely freeing. - -0:14:26.020,0:14:27.290 -That's the mount point - -0:14:27.290,0:14:28.770 -where the vnodes - -0:14:28.770,0:14:31.740 -belong - -0:14:31.740,0:14:33.930 -and - -0:14:33.930,0:14:35.910 -that is the backtrace - -0:14:35.910,0:14:39.760 -of what happened when the vnode - -0:14:39.760,0:14:41.060 -was acquired. - -0:14:41.060,0:14:46.600 -You can see that this comment also gives you information - -0:14:46.600,0:14:49.000 -about the lock linked to the vnode. - -0:14:49.000,0:14:51.640 -For example, it tells you that - -0:14:51.640,0:14:52.830 -the lock - -0:14:52.830,0:14:55.040 -is in exclusive mode - -0:14:55.040,0:14:56.280 -and - -0:14:56.280,0:14:59.320 -it does some shared waits - -0:14:59.320,0:15:03.260 -on its queues. - -0:15:03.260,0:15:04.090 -That's also - -0:15:04.090,0:15:06.370 -the node number. - -0:15:06.370,0:15:09.140 - - -0:15:09.140,0:15:13.880 -There is also other information you could receive -from the DDB linked to, for example, - -0:15:13.880,0:15:16.980 -the bugging deadlocks, - -0:15:16.980,0:15:18.100 -like sleep chains, - -0:15:18.100,0:15:19.310 -for any - -0:15:19.310,0:15:24.250 -wait channel, if you have the address - -0:15:24.250,0:15:27.150 -and for example, - -0:15:27.150,0:15:32.650 -you can also print the wall table of -the lock relations from WITNESS - -0:15:32.650,0:15:38.010 -but it's mostly never useful -because you should already know that. - -0:15:38.010,0:15:41.100 -So you will just need to know which is the one - -0:15:41.100,0:15:41.980 -that - -0:15:41.980,0:15:43.019 - - -0:15:43.019,0:15:47.750 -can give the trouble. - -0:15:47.750,0:15:51.640 - -0:15:51.640,0:15:53.980 -So if you are going to submit some problems - -0:15:53.980,0:15:57.180 -usually called NGAP that are probably -deadlocked in the kernel space, - -0:15:57.180,0:16:04.130 -that ones - -0:16:04.130,0:16:11.130 -are the information that we actually need. - -0:16:11.650,0:16:14.970 -Now, - -0:16:14.970,0:16:18.950 -it's very difficult to see very good reports -about deadlocks, - -0:16:18.950,0:16:20.020 -so - -0:16:20.020,0:16:22.569 -I think that - -0:16:22.569,0:16:25.670 -it is a very good thing to talk about it. - -0:16:25.670,0:16:31.420 -Along with the WITNESS, we have another -important mechanism that could help us with deadlocks - -0:16:31.420,0:16:34.620 -and it's called KTR. - -0:16:34.620,0:16:36.100 -KTR is - -0:16:36.100,0:16:40.630 -basically a logger, a kernel logger, of events. - -0:16:40.630,0:16:42.550 -It's - -0:16:42.550,0:16:45.090 -highly configurable, - -0:16:45.090,0:16:48.280 -as you can, for example, -handle different classes of events. - -0:16:48.280,0:16:53.940 -In FreeBSD we have - -0:16:53.940,0:16:55.130 -classes linked to the scheduler, - -0:16:55.130,0:16:56.290 -to the locking, - -0:16:56.290,0:16:58.520 -to the VFS, to callouts, - -0:16:58.520,0:17:05.030 -and they are all packed in the same class. - -0:17:05.030,0:17:08.880 -So they can be selectively enabled or masked. - -0:17:08.880,0:17:10.030 -For example - -0:17:10.030,0:17:12.190 -the difference is that you can -enable several classes, - -0:17:12.190,0:17:13.610 -like - -0:17:13.610,0:17:16.470 -the ten classes of the KTR - -0:17:16.470,0:17:21.000 -and then you are just interested in three of them -even if all ten of them are - -0:17:21.000,0:17:23.030 -actually tracked. - -0:17:23.030,0:17:24.240 -You will are just going to - -0:17:24.240,0:17:26.839 -see three of them. - -0:17:26.839,0:17:28.439 -Um, - -0:17:28.439,0:17:31.940 -an important thing is that KTR - -0:17:31.940,0:17:34.520 -doesn't handle, - -0:17:34.520,0:17:37.770 -for example - -0:17:37.770,0:17:38.300 -pointers, - -0:17:38.300,0:17:40.340 -doesn't store the information - -0:17:40.340,0:17:45.450 -passed to init, it just stores the pointer - -0:17:45.450,0:17:46.880 -and not the information, - -0:17:46.880,0:17:48.390 -for example, - -0:17:48.390,0:17:50.160 -the strings, - -0:17:50.160,0:17:55.000 -it doesn't make copies, you need to just pass -the pointers - -0:17:55.000,0:17:57.610 -which are persistent in the memory. - -0:17:57.610,0:18:00.340 -Otherwise, - -0:18:00.340,0:18:05.500 -KTR won't be able to access them. - -0:18:05.500,0:18:09.760 -The good thing about KTR is that - -0:18:09.760,0:18:11.600 -you can also look at it from the user space - -0:18:11.600,0:18:13.430 -through the ktrdump interface. - -0:18:13.430,0:18:15.820 - - -0:18:15.820,0:18:17.030 - - -0:18:17.030,0:18:19.669 -Why is that important for locking? - -0:18:19.669,0:18:20.279 -Because - -0:18:20.279,0:18:21.090 -after, - -0:18:21.090,0:18:24.350 -it can tell you what happed, - -0:18:24.350,0:18:27.260 -on which CPU branches, - -0:18:27.260,0:18:30.020 -and the order it happened in. - -0:18:30.020,0:18:34.580 -This is very important when you're -going to track down for example traces, - -0:18:34.580,0:18:37.720 -when you are not sure about the order of operations and - -0:18:37.720,0:18:44.710 -how they happened. It's going to tell you. - -0:18:44.710,0:18:46.290 -For example - -0:18:46.290,0:18:48.650 -that is - -0:18:48.650,0:18:51.090 -a typical trace of KTR, - -0:18:51.090,0:18:52.410 -where you have - -0:18:52.410,0:18:56.890 -the CPU where the event happened, thats the index, - -0:18:56.890,0:18:58.620 -that's a timestamp, - -0:18:58.620,0:19:03.400 -I think it's retrieved directly from the TSC, -but i'm actually not sure. - -0:19:03.400,0:19:04.889 -In this case, - -0:19:04.889,0:19:10.210 -i was tracking down the scheduler class, - -0:19:10.210,0:19:16.100 -so I was interested mainly in scheduler -workloads and I could see - -0:19:16.100,0:19:19.210 -for example - -0:19:19.210,0:19:21.100 -that - -0:19:21.100,0:19:24.870 -a contact switch happened - -0:19:24.870,0:19:26.919 -scheduling - -0:19:26.919,0:19:28.010 -the idle CPU - -0:19:28.010,0:19:30.270 -and then other information, - -0:19:30.270,0:19:34.370 -like for example the load of the CPU 1 - -0:19:34.370,0:19:37.190 -and scan priority boost, - -0:19:37.190,0:19:38.870 -like - -0:19:38.870,0:19:40.310 -this one - -0:19:40.310,0:19:46.420 -and other things. - -0:19:46.420,0:19:48.770 - - -0:19:48.770,0:19:50.820 -You can enable - -0:19:50.820,0:19:55.280 -the option KTR, but you must handle it carefully. - -0:19:55.280,0:19:57.130 -For example - -0:19:57.130,0:20:01.990 -use an option i didn't include here, - -0:20:01.990,0:20:07.410 -which is the length of the buffer it uses -to store the pointers in, it's called KTR_ENTRIES, - -0:20:07.410,0:20:08.360 -and you should specify - -0:20:08.360,0:20:09.590 -enough entries - -0:20:09.590,0:20:11.500 -to have a reliable tracking. - -0:20:11.500,0:20:13.580 - - -0:20:13.580,0:20:16.780 -For example, if you are going to track a lot of events, - -0:20:16.780,0:20:19.100 -a short queue is not an option, - -0:20:19.100,0:20:22.120 -because you are going to lose some information. - -0:20:22.120,0:20:26.780 -A typical queue is of length 2K (2 kilobytes) - -0:20:26.780,0:20:29.710 -of entries. - -0:20:29.710,0:20:32.520 -These other options - -0:20:32.520,0:20:36.190 -let you compile some classes, - -0:20:36.190,0:20:38.370 -or mask them, - -0:20:38.370,0:20:43.770 -or even mask the CPU. - -0:20:43.770,0:20:46.289 -If you have a big SMP environment, - -0:20:46.289,0:20:50.160 -so that you can selectively enable some of them. - -0:20:50.160,0:20:54.700 -For example, this is very good for -tracking down traces in the sleeping queue. - -0:20:54.700,0:21:01.700 -You can find referrals here. - -0:21:02.770,0:21:04.820 - - -0:21:04.820,0:21:06.220 -So, - -0:21:06.220,0:21:09.020 -I will spend the last time of the speech -speaking about possible improvements - -0:21:09.020,0:21:10.500 -to - -0:21:10.500,0:21:15.670 -our locking system, which is not very bad. - -0:21:15.670,0:21:16.500 -Well, - -0:21:16.500,0:21:21.750 -I think that actually our locking system -is pretty complete, - -0:21:21.750,0:21:26.919 -but it's also pretty confusing for newcomers, -it's not widely documented. - -0:21:26.919,0:21:32.280 -so maybe we will spend a good amount of time -on documentation. - -0:21:32.280,0:21:32.799 -As you can see, - -0:21:32.799,0:21:38.120 -even in this presentation, which is not very huge, - -0:21:38.120,0:21:42.540 -there are many things to say - -0:21:42.540,0:21:46.700 -and that are not very simple to understand in particular - -0:21:46.700,0:21:48.240 -for people - -0:21:48.240,0:21:50.280 -who just need to do simple tasks. - -0:21:50.280,0:21:56.660 -For example, I saw a lot of guys coming from Linux World - -0:21:56.660,0:22:00.620 -who wanted to actually use spinlocks for time. - -0:22:00.620,0:22:05.720 -It's obvious they are missing something from our -architecture. - -0:22:05.720,0:22:07.250 -From - -0:22:07.250,0:22:11.010 -just a technical point of view, - -0:22:11.010,0:22:14.530 -it would be very good if we could remove - -0:22:14.530,0:22:20.440 -legacy support and overriding support. For example, - -0:22:20.440,0:22:22.900 -we have lockmgr and sxlog - -0:22:22.900,0:22:27.990 -which are both read/write locks and -are both servered by sleep queues. - -0:22:27.990,0:22:31.800 -They have some differences, obviously, - -0:22:31.800,0:22:32.660 -but, mainly, - -0:22:32.660,0:22:38.920 -we could manage the missing bits and -just use one of the two interfaces. - -0:22:38.920,0:22:42.059 - - -0:22:42.059,0:22:43.920 -In the same way, as i told you before, - -0:22:43.920,0:22:47.340 -the sleeping point, true-end sleep, -read/write sleep and sxsleep - -0:22:47.340,0:22:50.350 -should probably be managed with cond vars - -0:22:50.350,0:22:52.930 -and superdoff our kernel - -0:22:52.930,0:22:55.070 -and we should probably drop sema, - -0:22:55.070,0:23:00.290 -because it is obsolete, and can be -replaced by condvars and mutex. - -0:23:00.290,0:23:02.620 - - -0:23:02.620,0:23:03.830 -From - -0:23:03.830,0:23:05.639 -a strong technical point of view, - -0:23:05.639,0:23:07.350 -as you can see, - -0:23:07.350,0:23:09.680 -we spent a lot of time - -0:23:09.680,0:23:12.109 -on optimization on our blocking primitives, - -0:23:12.109,0:23:16.770 -but very few on our spinning primitives. - -0:23:16.770,0:23:21.810 -That's because obviously blocking -primitives are our first choice, - -0:23:21.810,0:23:22.700 -but - -0:23:22.700,0:23:25.210 -spinlocks could be improved too, - -0:23:25.210,0:23:30.130 -using technics such as the so-called -back-off algorithms and - -0:23:30.130,0:23:31.429 -queued spinlocks. - -0:23:31.429,0:23:33.560 -I have a patch for that but - -0:23:33.560,0:23:35.499 -I would really need - -0:23:35.499,0:23:39.999 -to test it and tune it on a big SMP environment. - -0:23:39.999,0:23:42.270 -I don't think - -0:23:42.270,0:23:43.540 -that, for now, - -0:23:43.540,0:23:45.730 -they can handle such an environment. - -0:23:45.730,0:23:47.820 -In addition, - -0:23:47.820,0:23:51.500 -I'm not sure you are familiar with -queued spinlocks algorithms. - -0:23:51.500,0:23:55.580 -Basically a back-off algorithms try - -0:23:55.580,0:24:02.250 -to reduce the cache pressure on the - -0:24:02.250,0:24:06.090 -threads contesting on the lock - -0:24:06.090,0:24:08.270 -by giving a time. - -0:24:08.270,0:24:10.290 -Instead, the other one - -0:24:10.290,0:24:16.180 -uses spinning on a local variable -which is not shared by the threads. - -0:24:16.180,0:24:18.030 -and the time spent - -0:24:18.030,0:24:20.140 -on that - -0:24:20.140,0:24:22.780 -local variable increases - -0:24:22.780,0:24:25.440 - - -0:24:25.440,0:24:28.320 - - -0:24:28.320,0:24:31.780 -with the passing of time. - -0:24:31.780,0:24:35.950 -Another interesting thing would be benchmarking - -0:24:35.950,0:24:37.930 -different wake-up algorithms for blocking primitives. - -0:24:37.930,0:24:42.390 -We have an algorithm that has proven to be - -0:24:42.390,0:24:42.910 - - -0:24:42.910,0:24:45.200 -quite good - -0:24:45.200,0:24:47.440 -but - -0:24:47.440,0:24:51.330 -we are not confronted with other kind of -wake-ups that could have - -0:24:51.330,0:24:56.450 -a higher overhead but could give time improvements - -0:24:56.450,0:24:59.760 -on a big SMP environment. - -0:24:59.760,0:25:02.500 - - -0:25:02.500,0:25:07.000 -Another thing that would be very interesting -to fix is the priority inversion problem - -0:25:07.000,0:25:08.670 -in the case of read locks. - -0:25:08.670,0:25:09.790 -There is an approach - -0:25:09.790,0:25:13.950 -called dominant record implemented in FreeBSD, - -0:25:13.950,0:25:16.360 -but i saw that it's pretty - -0:25:16.360,0:25:22.160 -slow for our fastpack. In FreeBSD all our locking -primitives are broken and add to path - -0:25:22.160,0:25:23.290 -where - -0:25:23.290,0:25:25.820 -the fastpack is - -0:25:25.820,0:25:28.620 -is often just a single atomic operation, - -0:25:28.620,0:25:30.010 -and - -0:25:30.010,0:25:33.770 -if it fails, - -0:25:33.770,0:25:36.900 -it falls down and the art pattern tries to do -all the end-work of the sleep queues. - -0:25:36.900,0:25:40.210 -in this case the owner of record technic -was going to make the fastpack too simple - -0:25:40.210,0:25:42.640 -Basically, - -0:25:42.640,0:25:46.310 -it just considers - -0:25:46.310,0:25:50.940 -the readets as one - -0:25:50.940,0:25:55.380 -and can switch in and out. - -0:25:55.380,0:26:02.210 -And it practically lands the priority to this -owner of record which does it's right log. - -0:26:02.210,0:26:06.900 - - -0:26:06.900,0:26:11.900 -Another important thing obviously is improving locking - -0:26:11.900,0:26:13.420 -where the - -0:26:13.420,0:26:15.649 -optimum approach is not chosen. - -0:26:15.649,0:26:21.039 -I see a lot of parts in which the -primitives chosen by the developers - -0:26:21.039,0:26:23.320 -are not the most suitable ones - -0:26:23.320,0:26:27.690 -and we should switch to the right one. - -0:26:27.690,0:26:33.120 -Like, for example, in discriminated usage -of the spinlocks - -0:26:33.120,0:26:34.990 -or the blocking primitives, - -0:26:34.990,0:26:40.430 -just to handle cases like that, -like the one we saw before with the malloc command, - -0:26:40.430,0:26:44.070 -that needs to sleep. - -0:26:44.070,0:26:44.320 -Any questions? |