From 8b8b0168e448460cef83bd8e7632467c30b2cec2 Mon Sep 17 00:00:00 2001 From: Murray Stokely Date: Mon, 8 Feb 2010 21:53:03 +0000 Subject: Add improved human edited transcript from Amazon Mechanical Turk and YouTube Automatic Captions. Sponsored by: FreeBSD Foundation --- .../2009/asiabsdcon/rao-kernellocking-2.sbv | 1650 ++++++++++++++++++++ 1 file changed, 1650 insertions(+) create mode 100644 en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv (limited to 'en_US.ISO8859-1/captions') diff --git a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv new file mode 100644 index 0000000000..16f08a914a --- /dev/null +++ b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv @@ -0,0 +1,1650 @@ +0:00:00.530,0:00:01.590 +So, basically, + +0:00:04.590,0:00:10.029 +we are going to look mainly in this second part +at how to + +0:00:10.029,0:00:11.519 +handle some + +0:00:11.519,0:00:12.560 +locking problems + +0:00:12.560,0:00:17.910 +that categorize in the kernel. + +0:00:17.910,0:00:24.410 +Here there are described two kinds of problems +you can get with locks, that are pretty much common + +0:00:24.410,0:00:27.859 +The first one is called Lock Order Reversal + +0:00:27.859,0:00:30.140 +when you have for example a thread A + +0:00:30.140,0:00:32.340 +which owns + +0:00:32.340,0:00:35.870 +a lock code, for example, L1 + +0:00:35.870,0:00:37.920 +and another thread B + +0:00:37.920,0:00:40.070 +which owns another lock, L2 + +0:00:40.070,0:00:43.150 +Then thread A tries to + +0:00:43.150,0:00:44.730 +Right, it's wrong. + +0:00:44.730,0:00:46.220 +The slide is wrong. + +0:00:46.220,0:00:48.020 +The slide is wrong. + +0:00:48.020,0:00:51.910 +Thread A tries to acquire L2, + +0:00:51.910,0:00:55.670 +but obviously it sleeps because +it is owned by thread B + +0:00:55.670,0:00:58.530 +and then thread B tries to acquire + +0:00:58.530,0:01:00.030 +the lock L1 + +0:01:00.030,0:01:02.240 +and it sleeps because + +0:01:02.240,0:01:06.440 +it's owned by thread B. + +0:01:06.440,0:01:11.410 +This is a situation that never ends and +it's pretty much well documented in Cormen + +0:01:11.410,0:01:16.150 +and in traditional literature. + +0:01:16.150,0:01:18.650 +It's a classical deadlock. + +0:01:18.650,0:01:19.960 +This means that, + +0:01:19.960,0:01:21.950 +as everybody who + +0:01:21.950,0:01:24.940 +has ever read a book + +0:01:24.940,0:01:25.899 +about an operating system, + +0:01:25.899,0:01:30.420 +knows that + +0:01:30.420,0:01:32.910 +locks should maintain + +0:01:32.910,0:01:34.319 +an ordering regard of each other. + +0:01:34.319,0:01:38.859 +That's not very simple when + +0:01:38.859,0:01:40.100 +you speak about a kernel. + +0:01:40.100,0:01:44.850 +From this point of view, the fact +that there are 3 kinds of classes of locks + +0:01:44.850,0:01:49.180 +is going to count because you can +never mix two different kinds of locks + +0:01:49.180,0:01:50.680 +for example + +0:01:50.680,0:01:51.610 +a spin lock + +0:01:51.610,0:01:53.770 +and a mutex. + +0:01:53.770,0:01:59.120 +You can mix in this way. + +0:01:59.120,0:02:01.720 + + +0:02:01.720,0:02:07.060 +So, you will see that this kind +of deadlocks are possible + +0:02:07.060,0:02:09.290 +only in the same class of locks, + +0:02:09.290,0:02:13.019 +like for example 2 mutex or 2 spin mutex, + +0:02:13.019,0:02:14.569 +or such. + +0:02:14.569,0:02:16.090 +Um... + +0:02:16.090,0:02:17.409 +Also + +0:02:17.409,0:02:19.949 +Even if it's not very well documented, + +0:02:19.949,0:02:22.880 +for example, spin locks, + +0:02:22.880,0:02:26.599 +in previous deep, as a way to +identify such kind of deadlocks. + +0:02:26.599,0:02:27.619 +And it's pretty much implemented... + +0:02:27.619,0:02:29.709 +a very much in it would + +0:02:29.709,0:02:32.449 +It's a feature enabled in the code. + +0:02:32.449,0:02:34.949 +They just count how many times they are spinning + +0:02:34.949,0:02:36.010 +and + +0:02:36.010,0:02:39.169 +if it exceeds + +0:02:39.169,0:02:41.379 +an exaggerated result, + +0:02:41.379,0:02:47.870 +it means that they are probable +under a deadlock and the system panics. + +0:02:47.870,0:02:52.489 +Another common problem about locking +is when you have + +0:02:52.489,0:02:55.189 +a wait channel + +0:02:55.189,0:02:56.659 +like a condition variable (cond var), + +0:02:56.659,0:02:58.849 +which it is protected by a lock. + +0:02:58.849,0:03:03.629 +If there are some ratio chances that +this condition variable encounters, + +0:03:03.629,0:03:05.489 +for example, + +0:03:05.489,0:03:07.219 +when cond var are chased + +0:03:07.219,0:03:12.649 +with some preliminary conditions, +like a waiter's counter + +0:03:12.649,0:03:16.359 +that has to be updated + +0:03:16.359,0:03:21.249 +anytime that any thread tries to sleep +on the counter or flags to be set. + +0:03:21.249,0:03:24.909 +If these conditions are not protected +by the same lock, + +0:03:24.909,0:03:30.569 +you can end up having some threads +sleeping on this wait channel + +0:03:30.569,0:03:34.589 +and nobody is going to wake up them again. + +0:03:34.589,0:03:37.629 +This is usually called missed wakeup + +0:03:37.629,0:03:41.249 +and it's a pretty common mistake + +0:03:41.249,0:03:44.799 +that leads to a deadlock. + +0:03:44.799,0:03:46.719 +The problem is that + +0:03:46.719,0:03:52.109 +it's very difficult to differentiate +between missed wakeup and, + +0:03:52.109,0:03:53.480 +for example, + +0:03:53.480,0:03:56.189 +forever sleep + +0:03:56.189,0:03:58.419 +of a thread + +0:03:58.419,0:04:01.859 +that is not likely to be awaken. + +0:04:01.859,0:04:07.109 +So these kind of deadlocks are +very difficult to be discovered + +0:04:07.109,0:04:11.669 +and will require some bit of +work that we will see right now. + +0:04:11.669,0:04:14.509 +For example, + +0:04:14.509,0:04:15.270 +using + +0:04:15.270,0:04:16.219 +some + +0:04:16.219,0:04:18.179 +kernel systems + +0:04:18.179,0:04:22.240 +and some things integrated into the debugger. + +0:04:22.240,0:04:22.979 +Um, + +0:04:22.979,0:04:25.520 +In FreeBSD, + +0:04:25.520,0:04:29.859 +we have quite a lot of good mechanisms +we can use to cope + +0:04:29.859,0:04:32.080 +with kernel problems. + +0:04:32.080,0:04:36.539 +The first one (and the most important) +is called witness. + +0:04:36.539,0:04:39.169 +It was introduced + +0:04:39.169,0:04:42.080 +in the context of SMPng + +0:04:42.080,0:04:44.979 +and has been written in the recent past, + +0:04:44.979,0:04:47.919 +mainly by a contribution of + +0:04:47.919,0:04:51.360 +Isilon systems. + +0:04:51.360,0:04:52.270 +They contributed back then + +0:04:52.270,0:04:54.989 +to the writing of witness. + +0:04:54.989,0:04:57.389 +This subsystem is very important + +0:04:57.389,0:05:02.730 +because it tracks down exactly every order + +0:05:02.730,0:05:03.949 +of the locks + +0:05:03.949,0:05:07.810 +So that, if there is an ordering violation like a LOR, + +0:05:07.810,0:05:09.439 +it's going to + +0:05:09.439,0:05:12.150 +tell the system. + +0:05:12.150,0:05:18.029 +You can even set it to directly panic if +it finds some deadlocks, + +0:05:18.029,0:05:19.879 +or only + +0:05:19.879,0:05:22.729 +some possible deadlocks. + +0:05:22.729,0:05:27.260 +Another important feature with it is + +0:05:27.260,0:05:32.569 +that it can keep track of read/write locks. + +0:05:32.569,0:05:33.690 +Doing that, + +0:05:33.690,0:05:36.539 +we can identify + +0:05:36.539,0:05:38.419 +deadlocks possible + +0:05:38.419,0:05:39.500 +even + +0:05:39.500,0:05:40.690 +on the + +0:05:40.690,0:05:45.529 +reader's path. + +0:05:45.529,0:05:49.609 +We could say that witness is pretty big, + +0:05:49.609,0:05:52.289 +so activating it + +0:05:52.289,0:05:55.039 +in your production system is never an option. + +0:05:55.039,0:05:59.929 +It's mainly used when you are going to +develop a new feature in the kernel + +0:05:59.929,0:06:02.110 +and you are going to test it heavily, + +0:06:02.110,0:06:05.479 +in particular if it has + +0:06:05.479,0:06:06.819 +some + +0:06:06.819,0:06:10.509 +relation to locking. + +0:06:10.509,0:06:13.089 +Uh, + +0:06:13.089,0:06:17.840 +We could also tell that with the new code +provided by Isilon and Nokia, + +0:06:17.840,0:06:19.150 +basically + +0:06:19.150,0:06:21.689 +the orad + +0:06:21.689,0:06:25.479 +offered by witness is greatly reduced to about + +0:06:25.479,0:06:27.699 +the 10th part of + +0:06:27.699,0:06:30.240 +what we had before. + +0:06:30.240,0:06:36.150 +Witness is very good at tracking LOR, + +0:06:36.150,0:06:37.849 +but + +0:06:37.849,0:06:40.009 +it's not very good at, for example, + +0:06:40.009,0:06:42.449 +trying to + +0:06:42.449,0:06:44.060 +help you + +0:06:44.060,0:06:47.479 +in the case of lost wakeups, + +0:06:47.479,0:06:49.519 +because of its nature, + +0:06:49.519,0:06:52.090 +mainly. + +0:06:52.090,0:06:55.889 +It has a very good integration with the DDB debugger + +0:06:55.889,0:06:57.740 +and + +0:06:57.740,0:06:58.879 +basically + +0:06:58.879,0:07:04.159 +and basically, it's in the 8th release, +we have new features + +0:07:04.159,0:07:05.759 +that help you + +0:07:05.759,0:07:08.389 +print out backtraces + +0:07:08.389,0:07:11.150 +of the contesting threads on the LORs + +0:07:11.150,0:07:16.039 +and their orderings + +0:07:16.039,0:07:17.549 +and + +0:07:17.549,0:07:23.550 +and shows some graphs of the relations. +Even from the user space, + +0:07:23.550,0:07:28.550 +you don't have to go into the kernel +degubber to look at it's output. + +0:07:28.550,0:07:35.550 +Well + +0:07:35.620,0:07:37.380 +... + +0:07:37.380,0:07:42.250 +Well, I see that sometimes when +they are released there is a confusion + +0:07:42.250,0:07:44.250 +about the information reports + +0:07:44.250,0:07:48.440 +in regard of deadlocks conditions and what help + +0:07:48.440,0:07:50.020 +users can provide to developers + +0:07:50.020,0:07:52.039 +about that. + +0:07:52.039,0:07:54.020 +So we are going to see + +0:07:54.020,0:07:58.700 +all the relevant information +when a deadlock + +0:07:58.700,0:07:59.590 +is in the kernel. + +0:07:59.590,0:08:02.490 +... + +0:08:02.490,0:08:03.389 +Usually, + +0:08:03.389,0:08:07.939 +if you want to find a deadlock +that's happening in the kernel + +0:08:07.939,0:08:10.909 +your first line of analysis start from the DDB + +0:08:10.909,0:08:13.919 +instead of a post-mortem analysis, + +0:08:13.919,0:08:16.839 +which is even more important, + +0:08:16.839,0:08:22.330 +but using DDB you will get more +processes and better information. + +0:08:22.330,0:08:24.970 +Uh, + +0:08:24.970,0:08:28.499 +The most important unit in order to find the deadlock + +0:08:28.499,0:08:34.389 +are the LORs reported by witness in order +to see if there is something strange + +0:08:34.389,0:08:36.690 +that can be happening. + +0:08:36.690,0:08:41.700 +You want to know the state of all the threads +that are running on the system that is deadlocking. + +0:08:41.700,0:08:42.900 +... + +0:08:42.900,0:08:47.050 +You can see that you're deadlocking if you see that + +0:08:47.050,0:08:48.070 +on the runqueue + +0:08:48.070,0:08:48.540 +there are + +0:08:48.540,0:08:51.850 +just idle threads. + +0:08:51.850,0:08:56.640 +Because it's like saying that +runqueues are complete flushed + +0:08:56.640,0:09:02.450 +and you have all the threads sleeping +in their own containers. + +0:09:02.450,0:09:07.850 +You need to know which are exactly locks +that are acquired + +0:09:07.850,0:09:11.270 +in the system + +0:09:11.270,0:09:15.570 +and that's something that witness provides, + +0:09:15.570,0:09:20.720 +and the very important things is +to know why the threads are stopping. + +0:09:20.720,0:09:24.250 +So one on the most important things is +retrieving what the threads were doing + +0:09:24.250,0:09:26.320 +when + +0:09:26.320,0:09:28.960 +they were put asleep. + +0:09:28.960,0:09:30.070 +... + +0:09:30.070,0:09:33.009 +The backtraces of all the threads involved + +0:09:33.009,0:09:37.130 +are so printed out in order to identify deadlocks. + +0:09:37.130,0:09:38.589 +In the case that + +0:09:38.589,0:09:42.830 +buffered cache and VFS are + +0:09:42.830,0:09:45.910 +probably parts of the deadlocking + +0:09:45.910,0:09:50.790 +you should also print out + +0:09:50.790,0:09:53.420 +the info about vnodes + +0:09:53.420,0:09:58.250 +and what we're interested in is which vnodes are called, + +0:09:58.250,0:09:59.320 +which + +0:09:59.320,0:10:01.370 +are actually referenced + +0:10:01.370,0:10:03.530 +and + +0:10:03.530,0:10:10.530 +which way they were called. + +0:10:11.030,0:10:13.380 +So + +0:10:13.380,0:10:15.770 +this is an example + +0:10:15.770,0:10:17.430 +of the + +0:10:17.430,0:10:18.880 +thread states + +0:10:18.880,0:10:20.760 +in the case of a deadlock. + +0:10:20.760,0:10:27.480 +This is an example of a real deadlock + +0:10:27.480,0:10:28.900 +but you can see + +0:10:28.900,0:10:31.890 +that + +0:10:31.890,0:10:35.650 +this is not totally complete. + +0:10:35.650,0:10:38.450 +But you can see that all the threads are sleeping. + +0:10:38.450,0:10:39.870 +Uh... + +0:10:39.870,0:10:43.580 +And this one is the message + +0:10:43.580,0:10:44.790 +used by the wait channel + +0:10:44.790,0:10:47.550 +on which they're sleeping on + +0:10:47.550,0:10:48.710 +or used by + +0:10:48.710,0:10:54.480 +the container like the turnsile or the sleep queue. + +0:10:54.480,0:10:59.410 +If I recall correctly, it's a forced amount +that does deadlocking at some point. + +0:10:59.410,0:11:01.290 +I'm not really sure + +0:11:01.290,0:11:04.190 +because I have to take a look at it. + +0:11:04.190,0:11:08.810 +You can see that the revelant command here +is -ps + +0:11:08.810,0:11:11.220 +that DDB supports. + +0:11:11.220,0:11:14.220 +Um, + +0:11:14.220,0:11:17.520 +Another important thing + +0:11:17.520,0:11:18.820 +is getting + +0:11:18.820,0:11:21.680 +the situation of all CPUs. + +0:11:21.680,0:11:24.100 +As you can see there, + +0:11:24.100,0:11:25.210 +usually + +0:11:25.210,0:11:31.600 +is because you can add some data structures corrupted + +0:11:31.600,0:11:34.320 +in the per-CPU datas. + +0:11:34.320,0:11:38.830 +That's a very common situation when you get deadlocks, + +0:11:38.830,0:11:40.280 +because, for example, + +0:11:40.280,0:11:43.149 +leaving a corrupted LPD will lead + +0:11:43.149,0:11:48.750 +I loved you too much review shellacking double +falls and things like that about that + +0:11:48.750,0:11:55.290 +to you having a bigger massive breakage like +double-faults. In general. it's a good idea to +look at all the CPUs involved in the system. + +0:11:55.290,0:11:57.310 +The command + +0:11:57.310,0:12:00.120 +is ""show allpcpu"" + +0:12:00.120,0:12:04.960 +Um, + +0:12:04.960,0:12:06.959 +This one + +0:12:06.959,0:12:12.009 +this one is a witness specific command -show alllocks +and it's going to show all the locks, + +0:12:12.009,0:12:13.130 +in the system, + +0:12:13.130,0:12:15.070 +who is the owner, + +0:12:15.070,0:12:15.850 +like in this case, + +0:12:15.850,0:12:17.690 +a mount, + +0:12:17.690,0:12:21.270 +and the thread is this one, + +0:12:21.270,0:12:23.660 +what lock is holding, + +0:12:23.660,0:12:24.970 +that's the address + +0:12:24.970,0:12:27.360 +and where it was acquired. + +0:12:27.360,0:12:31.140 +It gives you lines and file. + +0:12:31.140,0:12:32.770 +... + +0:12:32.770,0:12:34.730 +Actually, + +0:12:34.730,0:12:37.620 +that's just possible + +0:12:37.620,0:12:40.859 +with witness, because otherwise, + +0:12:40.859,0:12:44.410 +trying to keep the oldest information + +0:12:44.410,0:12:51.410 +in a general purpose kernel will be +very expensive for our logging subsystem. + +0:12:55.330,0:12:59.730 +Then, the most important thing is +the backtrace for any thread. + +0:12:59.730,0:13:01.150 +... + +0:13:01.150,0:13:03.390 +It's going to show the backtrace + +0:13:03.390,0:13:05.700 +for old threads. + +0:13:05.700,0:13:08.380 +the seas + +0:13:08.380,0:13:09.169 +In this case, + +0:13:09.169,0:13:13.010 +the thread with these addresses TID and PID + +0:13:13.010,0:13:15.350 +basically got sleeping + +0:13:15.350,0:13:17.140 +on a pnode. + +0:13:17.140,0:13:22.020 +You will see that the backend in this case is FFF + +0:13:22.020,0:13:24.000 +and + +0:13:24.000,0:13:25.729 +that's the context switching function, + +0:13:25.729,0:13:26.900 +next + +0:13:26.900,0:13:32.220 +those are the sleepqueues of the containter +that is containing the threads, + +0:13:32.220,0:13:34.230 +and this one + +0:13:34.230,0:13:36.370 +is what it was going to do + +0:13:36.370,0:13:37.910 +before, + +0:13:37.910,0:13:41.810 +in this case mounting the filesystems. + +0:13:41.810,0:13:47.220 +You will see that on a complete feeding, + +0:13:47.220,0:13:50.310 +you will have a lot of these kinds of traces, + +0:13:50.310,0:13:53.079 +but they are very important + +0:13:53.079,0:13:59.270 +so as developers to understand what is going on. + +0:13:59.270,0:14:02.590 +This ones are the locked vnodes + +0:14:02.590,0:14:05.830 +that are also very important when + +0:14:05.830,0:14:11.780 +a deadlock happens in VFS or in the buffer cache. + +0:14:11.780,0:14:13.700 +You will see + +0:14:13.700,0:14:18.580 +that these are the ref counts linked to vnodes, + +0:14:18.580,0:14:20.980 +they are specific + +0:14:20.980,0:14:23.850 +to some handling of the vnodes such as recycling, + +0:14:23.850,0:14:26.020 +and completely freeing + +0:14:26.020,0:14:27.290 +that's the mount point + +0:14:27.290,0:14:28.770 +where the vnodes + +0:14:28.770,0:14:31.740 +belong + +0:14:31.740,0:14:33.930 +and + +0:14:33.930,0:14:35.910 +that is the backtrace + +0:14:35.910,0:14:39.760 +of what happened when the vnode + +0:14:39.760,0:14:41.060 +was acquired. + +0:14:41.060,0:14:46.600 +You can see that this comment also gives you information + +0:14:46.600,0:14:49.000 +about the lock linked to the vnode. + +0:14:49.000,0:14:51.640 +For example, it tells you that + +0:14:51.640,0:14:52.830 +the lock + +0:14:52.830,0:14:55.040 +the lock is in exclusive mode + +0:14:55.040,0:14:56.280 +and + +0:14:56.280,0:14:59.320 +it does some shared waits + +0:14:59.320,0:15:03.260 +on its queues. + +0:15:03.260,0:15:04.090 +That's also + +0:15:04.090,0:15:06.370 +the node number + +0:15:06.370,0:15:09.140 +... + +0:15:09.140,0:15:13.880 +there are also under information you could receive +from the DDB linked to, for example, + +0:15:13.880,0:15:16.980 +the bugging deadlocks, + +0:15:16.980,0:15:18.100 +like sleep chains, + +0:15:18.100,0:15:19.310 +for any + +0:15:19.310,0:15:24.250 +wait channel if you have the address + +0:15:24.250,0:15:27.150 +and, for example, + +0:15:27.150,0:15:32.650 +you can also print the wall table of +the lock relations from witness + +0:15:32.650,0:15:38.010 +but it's mostly never useful +because you already know that. + +0:15:38.010,0:15:41.100 +So you will just need to know which is the one + +0:15:41.100,0:15:41.980 +that + +0:15:41.980,0:15:43.019 +um, + +0:15:43.019,0:15:47.750 +can give trouble. + +0:15:47.750,0:15:51.640 +if you're going to say I mean some problem +is a problem + +0:15:51.640,0:15:53.980 +So if you are going to submit some problems + +0:15:53.980,0:15:57.180 +usually called NGAP that are probably +deadlocked in the kernel space, + +0:15:57.180,0:16:04.130 +that ones + +0:16:04.130,0:16:11.130 +are the information that we actually need. + +0:16:11.650,0:16:14.970 +Now, + +0:16:14.970,0:16:18.950 +it's very difficult to see very good reports +about deadlocks, + +0:16:18.950,0:16:20.020 +so + +0:16:20.020,0:16:22.569 +I think that + +0:16:22.569,0:16:25.670 +it is a very good thing to talk about it. + +0:16:25.670,0:16:31.420 +Along with the witness, we have another +mechanism that could help us, + +0:16:31.420,0:16:34.620 +and it's called KTR. + +0:16:34.620,0:16:36.100 +KTR is + +0:16:36.100,0:16:40.630 +basically a logger of events, + +0:16:40.630,0:16:42.550 +it's + +0:16:42.550,0:16:45.090 +highly configurable, + +0:16:45.090,0:16:48.280 +as you can, for example, +handle different classes of events, + +0:16:48.280,0:16:53.940 +In FreeBSD we have + +0:16:53.940,0:16:55.130 +classes linked to the scheduler, + +0:16:55.130,0:16:56.290 +to the locking, + +0:16:56.290,0:16:58.520 +to the VFS, to callouts, + +0:16:58.520,0:17:05.030 +and they are all packed in the same class. + +0:17:05.030,0:17:08.880 +So they can be selectively enabled or masked. + +0:17:08.880,0:17:10.030 +For example + +0:17:10.030,0:17:12.190 +the difference is that you can +enable several classes, + +0:17:12.190,0:17:13.610 +like + +0:17:13.610,0:17:16.470 +the time class of the KTR + +0:17:16.470,0:17:21.000 +and then you are just interested in three of them +even if all ten of them are + +0:17:21.000,0:17:23.030 +actually tracked. + +0:17:23.030,0:17:24.240 +You will are just going to + +0:17:24.240,0:17:26.839 +see three of them. + +0:17:26.839,0:17:28.439 +Um, + +0:17:28.439,0:17:31.940 +an important thing is that KTR + +0:17:31.940,0:17:34.520 +doesn't handle, + +0:17:34.520,0:17:37.770 +for example + +0:17:37.770,0:17:38.300 +pointers, + +0:17:38.300,0:17:40.340 +doesn't store the information + +0:17:40.340,0:17:45.450 +passed to init, it just stores the pointer + +0:17:45.450,0:17:46.880 +and not the information, + +0:17:46.880,0:17:48.390 +for example, + +0:17:48.390,0:17:50.160 +a string, + +0:17:50.160,0:17:55.000 +it doesn't make copies, you need to just pass +the pointers + +0:17:55.000,0:17:57.610 +which are persistent in the memory. + +0:17:57.610,0:18:00.340 +Otherwise, + +0:18:00.340,0:18:05.500 +KTR won't be able to access them. + +0:18:05.500,0:18:09.760 +The good thing about KTR is that + +0:18:09.760,0:18:11.600 +you can also look at it from the user space + +0:18:11.600,0:18:13.430 +through the ktrdump interface. + +0:18:13.430,0:18:15.820 +... + +0:18:15.820,0:18:17.030 +And now... + +0:18:17.030,0:18:19.669 +Why is that important for locking? + +0:18:19.669,0:18:20.279 +Because + +0:18:20.279,0:18:21.090 +after, + +0:18:21.090,0:18:24.350 +it can tell you what happed, + +0:18:24.350,0:18:27.260 +on which CPU branches, + +0:18:27.260,0:18:30.020 +and the order it happened in. + +0:18:30.020,0:18:34.580 +and this is very important when you're +going to track down for example races, + +0:18:34.580,0:18:37.720 +when you are not sure about the order of operations and + +0:18:37.720,0:18:44.710 +how they happened. It's going to tell you. + +0:18:44.710,0:18:46.290 +For example + +0:18:46.290,0:18:48.650 +that is + +0:18:48.650,0:18:51.090 +a typical trace of KTR, + +0:18:51.090,0:18:52.410 +where you have + +0:18:52.410,0:18:56.890 +the CPU where the event happened, the index, + +0:18:56.890,0:18:58.620 +that's a timestamp, + +0:18:58.620,0:19:03.400 +I think it's retrieved directly from the TSC, +but i'm not actually sure. + +0:19:03.400,0:19:04.889 +In this case, + +0:19:04.889,0:19:10.210 +i was tracking down the scheduler class, + +0:19:10.210,0:19:16.100 +so I was interested mainly in scheduler +workloads and i could see + +0:19:16.100,0:19:19.210 +for example + +0:19:19.210,0:19:21.100 +that + +0:19:21.100,0:19:24.870 +a contact switch happened + +0:19:24.870,0:19:26.919 +scheduling + +0:19:26.919,0:19:28.010 +the idle CPU + +0:19:28.010,0:19:30.270 +and then other information, + +0:19:30.270,0:19:34.370 +like for example the load of the CPU 1 + +0:19:34.370,0:19:37.190 +and scan priority boost, + +0:19:37.190,0:19:38.870 +like + +0:19:38.870,0:19:40.310 +this one + +0:19:40.310,0:19:46.420 +and other things. + +0:19:46.420,0:19:48.770 +Well + +0:19:48.770,0:19:50.820 +you can enable it + +0:19:50.820,0:19:55.280 +the option KTR, but you must handle it carefully. + +0:19:55.280,0:19:57.130 +For example + +0:19:57.130,0:20:01.990 +use an option i didn't include here, + +0:20:01.990,0:20:07.410 +which is the length of the buffer it uses +to store the pointers in, it's called KTR_ENTRIES, + +0:20:07.410,0:20:08.360 +and you should specify + +0:20:08.360,0:20:09.590 +enough entries + +0:20:09.590,0:20:11.500 +to have a reliable tracking. + +0:20:11.500,0:20:13.580 +... + +0:20:13.580,0:20:16.780 +For example, if you are going to track a lot of events, + +0:20:16.780,0:20:19.100 +a short queue is not an option, + +0:20:19.100,0:20:22.120 +because you are going to lose some information. + +0:20:22.120,0:20:26.780 +A typical queue is of length 2K (2 kilobytes) + +0:20:26.780,0:20:29.710 +of entries. + +0:20:29.710,0:20:32.520 +These other options + +0:20:32.520,0:20:36.190 +let you compile some classes, + +0:20:36.190,0:20:38.370 +or mask them, + +0:20:38.370,0:20:43.770 +or mask a CPU, + +0:20:43.770,0:20:46.289 +if you have a big SMP environment, + +0:20:46.289,0:20:50.160 +so that you can selectively enable some of them. + +0:20:50.160,0:20:54.700 +For example, this is very good for +tracking down races in the sleeping queue. + +0:20:54.700,0:21:01.700 +You can find referrals here. + +0:21:02.770,0:21:04.820 +... + +0:21:04.820,0:21:06.220 +So, + +0:21:06.220,0:21:09.020 +I will spend the last time of the speech +speaking about possible improvements + +0:21:09.020,0:21:10.500 +to + +0:21:10.500,0:21:15.670 +our locking system, which is not very bad. + +0:21:15.670,0:21:16.500 +Well, + +0:21:16.500,0:21:21.750 +I think that actually our locking system +is pretty complete, + +0:21:21.750,0:21:26.919 +but it's also confusing for newcomers, +it's not widely documented. + +0:21:26.919,0:21:32.280 +so maybe we will spend a good amount of time +on documentation. + +0:21:32.280,0:21:32.799 +As you can see, + +0:21:32.799,0:21:38.120 +even in this presentation, which is not very huge, + +0:21:38.120,0:21:42.540 +there are many things to say + +0:21:42.540,0:21:46.700 +and that are not very simple to understand in particular + +0:21:46.700,0:21:48.240 +for people + +0:21:48.240,0:21:50.280 +who just need to do simple tasks. + +0:21:50.280,0:21:56.660 +For example, I saw a lot of guys coming from Linux World + +0:21:56.660,0:22:00.620 +who wanted to actually use spinlocks for time + +0:22:00.620,0:22:05.720 +it's obvious they are missing something from our +architecture + +0:22:05.720,0:22:07.250 +From, uh, ... + +0:22:07.250,0:22:11.010 +just a technical point of view, + +0:22:11.010,0:22:14.530 +it would be very good if we could remove + +0:22:14.530,0:22:20.440 +legacy support and overriding support. For example, + +0:22:20.440,0:22:22.900 +we have lockmgr and sxlog + +0:22:22.900,0:22:27.990 +which are both read/write locks and +are both serverd by sleep queues. + +0:22:27.990,0:22:31.800 +They have some differences, obviously, + +0:22:31.800,0:22:32.660 +but, mainly, + +0:22:32.660,0:22:38.920 +we could manage the missing bits and +just one of the 2 interfaces. + +0:22:38.920,0:22:42.059 +on the scene where he hasn't told you before +visiting all + +0:22:42.059,0:22:43.920 +In the same way, as i told you before, + +0:22:43.920,0:22:47.340 +the sleeping point, true end sleep, +read/write sleep and sxsleep + +0:22:47.340,0:22:50.350 +should probably be merged with cond vars + +0:22:50.350,0:22:52.930 +and superdoff our kernel + +0:22:52.930,0:22:55.070 +and we should probably drop sema, + +0:22:55.070,0:23:00.290 +because it is obsolete, and can be +replaced by condova and mutex. + +0:23:00.290,0:23:02.620 +... + +0:23:02.620,0:23:03.830 +From + +0:23:03.830,0:23:05.639 +a strong technical point of view, + +0:23:05.639,0:23:07.350 +as you can see, + +0:23:07.350,0:23:09.680 +we spent a lot of time + +0:23:09.680,0:23:12.109 +on optimization on our blocking primitives, + +0:23:12.109,0:23:16.770 +but very few on our spinning primitives. + +0:23:16.770,0:23:21.810 +That's because obviously blocking +primitives are our first choice, + +0:23:21.810,0:23:22.700 +but + +0:23:22.700,0:23:25.210 +spinlocks could be improved too, + +0:23:25.210,0:23:30.130 +using technics such as the so-called +back-off algorithms and + +0:23:30.130,0:23:31.429 +queued spinlocks. + +0:23:31.429,0:23:33.560 +I have a patch for that but + +0:23:33.560,0:23:35.499 +I would really need + +0:23:35.499,0:23:39.999 +to test it and tune it on a big SMP environment. + +0:23:39.999,0:23:42.270 +I don't think + +0:23:42.270,0:23:43.540 +that, for now, + +0:23:43.540,0:23:45.730 +they can handle such an environment. + +0:23:45.730,0:23:47.820 +In addition, + +0:23:47.820,0:23:51.500 +I'm not sure you are familiar with +queued spinlocks algorithms. + +0:23:51.500,0:23:55.580 +Basically a back-off algorithms try + +0:23:55.580,0:24:02.250 +to reduce the cache pressure on the + +0:24:02.250,0:24:06.090 +threads contesting on the lock + +0:24:06.090,0:24:08.270 +by giving a time. + +0:24:08.270,0:24:10.290 +Instead, the other one + +0:24:10.290,0:24:16.180 +uses spinning on a local variable +which is not shared by the threads + +0:24:16.180,0:24:18.030 +and the time spent + +0:24:18.030,0:24:20.140 +on that + +0:24:20.140,0:24:22.780 +local variable increases + +0:24:22.780,0:24:25.440 +um... + +0:24:25.440,0:24:28.320 +... + +0:24:28.320,0:24:31.780 +with the passing of time. + +0:24:31.780,0:24:35.950 +Another interesting thing would be benchmarking + +0:24:35.950,0:24:37.930 +different wake-up algorithms for blocking primitives. + +0:24:37.930,0:24:42.390 +We have an algorithms that has proven to be + +0:24:42.390,0:24:42.910 +... + +0:24:42.910,0:24:45.200 +quite good + +0:24:45.200,0:24:47.440 +but + +0:24:47.440,0:24:51.330 +we are not confronted with other kind of +wake-ups that could have + +0:24:51.330,0:24:56.450 +a higher overhead but could give time improvements + +0:24:56.450,0:24:59.760 +on a big SMP environment. + +0:24:59.760,0:25:02.500 +Um, ... + +0:25:02.500,0:25:07.000 +another thing that would be very interesting +to fix is the priority inversion problem + +0:25:07.000,0:25:08.670 +in the case of read locks. + +0:25:08.670,0:25:09.790 +There is an approach + +0:25:09.790,0:25:13.950 +called dominant record implemented in FreeBSD, + +0:25:13.950,0:25:16.360 +but i saw that it's pretty + +0:25:16.360,0:25:22.160 +slow for our fastpack. In FreeBSD all our locking +primitives are broken and add to path + +0:25:22.160,0:25:23.290 +where + +0:25:23.290,0:25:25.820 +the fastpack is + +0:25:25.820,0:25:28.620 +is often just a single atomic operation, + +0:25:28.620,0:25:30.010 +and + +0:25:30.010,0:25:33.770 +if it fails + +0:25:33.770,0:25:36.900 +it falls down and the art pattern tries to do +all the end-work of the sleep queues. + +0:25:36.900,0:25:40.210 +in this case the owner of record technic +was going to make the fastpack too simple + +0:25:40.210,0:25:42.640 +Basically + +0:25:42.640,0:25:46.310 +it just considers + +0:25:46.310,0:25:50.940 +the readets as one + +0:25:50.940,0:25:55.380 +and can switch in and out. + +0:25:55.380,0:26:02.210 +And it practically lands the priority to this +owner of record which does it's right log. + +0:26:02.210,0:26:06.900 +... + +0:26:06.900,0:26:11.900 +Another important thing obviously is improving locking + +0:26:11.900,0:26:13.420 +where the + +0:26:13.420,0:26:15.649 +optimum approach is not chosen. + +0:26:15.649,0:26:21.039 +I see a lot of parts in which the +primitives chosen by the developers + +0:26:21.039,0:26:23.320 +are not the most suitable ones + +0:26:23.320,0:26:27.690 +and we should switch to the right one. + +0:26:27.690,0:26:33.120 +Like, for example, in discriminated usage +of the spinlocks + +0:26:33.120,0:26:34.990 +or the blocking primitives, + +0:26:34.990,0:26:40.430 +just to handle cases like that, +like the one we saw before with the malloc command, + +0:26:40.430,0:26:44.070 +that needs to sleep. + +0:26:44.070,0:26:44.320 +Questions? -- cgit v1.2.3