aboutsummaryrefslogtreecommitdiff
path: root/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv
diff options
context:
space:
mode:
Diffstat (limited to 'en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv')
-rw-r--r--en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv1645
1 files changed, 0 insertions, 1645 deletions
diff --git a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv
deleted file mode 100644
index bf785730a9..0000000000
--- a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv
+++ /dev/null
@@ -1,1645 +0,0 @@
-0:00:00.530,0:00:01.590
-So basically,
-
-0:00:04.590,0:00:10.029
-we are going to look, mainly in this second part,
-at how to
-
-0:00:10.029,0:00:11.519
-handle some
-
-0:00:11.519,0:00:12.560
-locking problems
-
-0:00:12.560,0:00:17.910
-that categorize in the kernel.
-
-0:00:17.910,0:00:24.410
-Here, there are described two kinds of problems
-you can get with locks, that are pretty much common.
-
-0:00:24.410,0:00:27.859
-The first one is called Lock Order Reversal (LOR).
-
-0:00:27.859,0:00:30.140
-When you have for example a thread A,
-
-0:00:30.140,0:00:32.340
-which owns
-
-0:00:32.340,0:00:35.870
-a lock code, for example L1
-
-0:00:35.870,0:00:37.920
-and another thread B
-
-0:00:37.920,0:00:40.070
-which owns the lock, L2
-
-0:00:40.070,0:00:43.150
-Then thread A tries to..
-
-0:00:43.150,0:00:44.730
-Right.. it's wrong.
-
-0:00:44.730,0:00:46.220
-The slide is wrong.
-
-0:00:46.220,0:00:48.020
-The slide is wrong.
-
-0:00:48.020,0:00:51.910
-Thread A tries to acquire L2,
-
-0:00:51.910,0:00:55.670
-but obviously it sleeps because
-it is owned by thread B
-
-0:00:55.670,0:00:58.530
-and then thread B tries to acquire
-
-0:00:58.530,0:01:00.030
-the lock L1
-
-0:01:00.030,0:01:02.240
-and it sleeps because
-
-0:01:02.240,0:01:06.440
-it's owned by thread B.
-
-0:01:06.440,0:01:11.410
-This is a situation that never ends and
-it's pretty much well documented in Cormen
-
-0:01:11.410,0:01:16.150
-and in traditional literature.
-
-0:01:16.150,0:01:18.650
-It's a classical deadlock.
-
-0:01:18.650,0:01:19.960
-This means that,
-
-0:01:19.960,0:01:21.950
-as everybody who
-
-0:01:21.950,0:01:24.940
-has ever read a book
-
-0:01:24.940,0:01:25.899
-about an operating system,
-
-0:01:25.899,0:01:30.420
-knows that
-
-0:01:30.420,0:01:32.910
-locks should maintain
-
-0:01:32.910,0:01:34.319
-an ordering in regard of each other.
-
-0:01:34.319,0:01:38.859
-That's not very simple when
-
-0:01:38.859,0:01:40.100
-you speak about a kernel.
-
-0:01:40.100,0:01:44.850
-From this point of view, the fact
-that there are 3 kinds of classes of locks
-
-0:01:44.850,0:01:49.180
-is going to count because you can
-never mix two different kinds of locks.
-
-0:01:49.180,0:01:50.680
-For example
-
-0:01:50.680,0:01:51.610
-a spinlock
-
-0:01:51.610,0:01:53.770
-and a mutex
-
-0:01:53.770,0:01:59.120
-can be mixed in this way.
-
-0:01:59.120,0:02:01.720
-You can have the mutex first and the spinlock later,
-while the opposite is not actually true.
-
-0:02:01.720,0:02:07.060
-So, you will see that these kind
-of deadlocks are possible
-
-0:02:07.060,0:02:09.290
-only in the same class of locks,
-
-0:02:09.290,0:02:13.019
-like for example 2 mutex or 2 spin mutex,
-
-0:02:13.019,0:02:14.569
-or such.
-
-0:02:14.569,0:02:16.090
-
-
-0:02:16.090,0:02:17.409
-Also,
-
-0:02:17.409,0:02:19.949
-even if it's not very well documented,
-
-0:02:19.949,0:02:22.880
-for example spinlocks
-
-0:02:22.880,0:02:26.599
-in FreeBSD, have a way to
-identify such kind of deadlocks.
-
-0:02:26.599,0:02:27.619
-And it's pretty much implemented.
-
-0:02:27.619,0:02:29.709
-
-
-0:02:29.709,0:02:32.449
-It's a feature enabled in the code.
-
-0:02:32.449,0:02:34.949
-They just count how many times they are spinning
-
-0:02:34.949,0:02:36.010
-and
-
-0:02:36.010,0:02:39.169
-if it exceeds
-
-0:02:39.169,0:02:41.379
-an exaggerated result,
-
-0:02:41.379,0:02:47.870
-it means that they are probably
-under a deadlock and the system panics.
-
-0:02:47.870,0:02:52.489
-Another common problem about locking
-is when you have
-
-0:02:52.489,0:02:55.189
-a wait channel
-
-0:02:55.189,0:02:56.659
-like a condition variable (cond var),
-
-0:02:56.659,0:02:58.849
-which it is protected by a lock.
-
-0:02:58.849,0:03:03.629
-If there are some ratio chances that
-this condition variable encounters,
-
-0:03:03.629,0:03:05.489
-for example,
-
-0:03:05.489,0:03:07.219
-when cond var are chased
-
-0:03:07.219,0:03:12.649
-with some preliminary conditions,
-like a waiter's counter
-
-0:03:12.649,0:03:16.359
-that has to be updated
-
-0:03:16.359,0:03:21.249
-anytime that any thread tries to sleep
-on the counter or flags to be set.
-
-0:03:21.249,0:03:24.909
-If these conditions are not protected
-by the same lock,
-
-0:03:24.909,0:03:30.569
-you can end up having some threads
-sleeping on this wait channel
-
-0:03:30.569,0:03:34.589
-and nobody is going to wake them up again.
-
-0:03:34.589,0:03:37.629
-This is usually called missed wakeup
-
-0:03:37.629,0:03:41.249
-and it's a pretty common mistake
-
-0:03:41.249,0:03:44.799
-that leads to a deadlock.
-
-0:03:44.799,0:03:46.719
-The problem is that
-
-0:03:46.719,0:03:52.109
-it's very difficult to differentiate
-between missed wakeup and
-
-0:03:52.109,0:03:53.480
-for example
-
-0:03:53.480,0:03:56.189
-forever sleep
-
-0:03:56.189,0:03:58.419
-of a thread
-
-0:03:58.419,0:04:01.859
-that is not likely to be awaken.
-
-0:04:01.859,0:04:07.109
-So these kind of deadlocks are
-very very difficult to be discovered
-
-0:04:07.109,0:04:11.669
-and will require some bit of
-work that we will see right now.
-
-0:04:11.669,0:04:14.509
-For example,
-
-0:04:14.509,0:04:15.270
-using
-
-0:04:15.270,0:04:16.219
-some
-
-0:04:16.219,0:04:18.179
-kernel systems
-
-0:04:18.179,0:04:22.240
-and some things integrated into the debugger.
-
-0:04:22.240,0:04:22.979
-
-
-0:04:22.979,0:04:25.520
-In FreeBSD,
-
-0:04:25.520,0:04:29.859
-we have quite a lot of good mechanisms
-we can use to cope
-
-0:04:29.859,0:04:32.080
-with kernel problems.
-
-0:04:32.080,0:04:36.539
-The first one (and the most important)
-is called WITNESS.
-
-0:04:36.539,0:04:39.169
-It was introduced
-
-0:04:39.169,0:04:42.080
-in the context of SMPng
-
-0:04:42.080,0:04:44.979
-and has been rewritten in the recent past,
-
-0:04:44.979,0:04:47.919
-mainly by a contribution of
-
-0:04:47.919,0:04:51.360
-Isilon systems.
-
-0:04:51.360,0:04:52.270
-They contributed back then
-
-0:04:52.270,0:04:54.989
-to the writing of WITNESS.
-
-0:04:54.989,0:04:57.389
-This subsystem is very important
-
-0:04:57.389,0:05:02.730
-because it tracks down exactly every order
-
-0:05:02.730,0:05:03.949
-of the locks.
-
-0:05:03.949,0:05:07.810
-So that, if there is an ordering violation like a LOR,
-
-0:05:07.810,0:05:09.439
-it's going to
-
-0:05:09.439,0:05:12.150
-tell the system.
-
-0:05:12.150,0:05:18.029
-You can even set it to directly panic if
-it finds some deadlocks,
-
-0:05:18.029,0:05:19.879
-or only
-
-0:05:19.879,0:05:22.729
-some possible deadlocks.
-
-0:05:22.729,0:05:27.260
-Another important feature with it is
-
-0:05:27.260,0:05:32.569
-that it can keep track of read/write locks.
-
-0:05:32.569,0:05:33.690
-Doing that,
-
-0:05:33.690,0:05:36.539
-we can identify
-
-0:05:36.539,0:05:38.419
-deadlocks, possibly
-
-0:05:38.419,0:05:39.500
-even
-
-0:05:39.500,0:05:40.690
-on the
-
-0:05:40.690,0:05:45.529
-reader's path.
-
-0:05:45.529,0:05:49.609
-We could say that WITNESS is pretty big,
-
-0:05:49.609,0:05:52.289
-so activating it
-
-0:05:52.289,0:05:55.039
-in your production system is never an option.
-
-0:05:55.039,0:05:59.929
-It's mainly used when you are going to
-develop a new feature in the kernel
-
-0:05:59.929,0:06:02.110
-and you are going to test it heavily.
-
-0:06:02.110,0:06:05.479
-In particular if it has
-
-0:06:05.479,0:06:06.819
-some
-
-0:06:06.819,0:06:10.509
-relation to locking.
-
-0:06:10.509,0:06:13.089
-
-
-0:06:13.089,0:06:17.840
-We could also tell that with the new code
-provided by Isilon and Nokia,
-
-0:06:17.840,0:06:19.150
-basically
-
-0:06:19.150,0:06:21.689
-
-
-0:06:21.689,0:06:25.479
-offered by WITNESS is greatly reduced to about
-
-0:06:25.479,0:06:27.699
-the 10th part of
-
-0:06:27.699,0:06:30.240
-what we had before.
-
-0:06:30.240,0:06:36.150
-WITNESS is very good at tracking LOR,
-
-0:06:36.150,0:06:37.849
-but
-
-0:06:37.849,0:06:40.009
-it's not very good at, for example,
-
-0:06:40.009,0:06:42.449
-trying to
-
-0:06:42.449,0:06:44.060
-help you
-
-0:06:44.060,0:06:47.479
-in the case of lost wakeups,
-
-0:06:47.479,0:06:49.519
-because of its nature,
-
-0:06:49.519,0:06:52.090
-mainly.
-
-0:06:52.090,0:06:55.889
-It has a very good integration with the DDB debugger
-
-0:06:55.889,0:06:57.740
-and
-
-0:06:57.740,0:06:58.879
-basically
-
-0:06:58.879,0:07:04.159
-it's in the 8th release,
-we have new features
-
-0:07:04.159,0:07:05.759
-that help you
-
-0:07:05.759,0:07:08.389
-print out backtraces
-
-0:07:08.389,0:07:11.150
-of the contesting threads on the LORs
-
-0:07:11.150,0:07:16.039
-and their orderings
-
-0:07:16.039,0:07:17.549
-and
-
-0:07:17.549,0:07:23.550
-it shows some graphs of the relations
-even from the user space.
-
-0:07:23.550,0:07:28.550
-You don't have to go into the kernel
-debugger to look at it's output.
-
-0:07:28.550,0:07:35.550
-
-
-0:07:35.620,0:07:37.380
-
-
-0:07:37.380,0:07:42.250
-Well, I see that sometimes when
-they are released there is a confusion
-
-0:07:42.250,0:07:44.250
-about the information reports
-
-0:07:44.250,0:07:48.440
-in regard of deadlocks conditions and what help
-
-0:07:48.440,0:07:50.020
-users can provide to developers
-
-0:07:50.020,0:07:52.039
-about that.
-
-0:07:52.039,0:07:54.020
-So we are going to see
-
-0:07:54.020,0:07:58.700
-all the relevant information
-when a deadlock
-
-0:07:58.700,0:07:59.590
-is in the kernel.
-
-0:07:59.590,0:08:02.490
-
-
-0:08:02.490,0:08:03.389
-Usually,
-
-0:08:03.389,0:08:07.939
-if you want to find a deadlock
-that's happening in the kernel,
-
-0:08:07.939,0:08:10.909
-your first line of analysis starts from the DDB
-
-0:08:10.909,0:08:13.919
-instead of a post-mortem analysis,
-
-0:08:13.919,0:08:16.839
-which is even more important.
-
-0:08:16.839,0:08:22.330
-But, using DDB you will get more
-processes and better information.
-
-0:08:22.330,0:08:24.970
-
-
-0:08:24.970,0:08:28.499
-The most important unit in order to find the deadlock
-
-0:08:28.499,0:08:34.389
-are the LORs reported by WITNESS in order
-to see if there is something strange
-
-0:08:34.389,0:08:36.690
-that can be happening.
-
-0:08:36.690,0:08:41.700
-You want to know the state of all the threads
-that are running on the system that is deadlocking.
-
-0:08:41.700,0:08:42.900
-
-
-0:08:42.900,0:08:47.050
-You can see that you're deadlocking, if you see that
-
-0:08:47.050,0:08:48.070
-on the runqueue
-
-0:08:48.070,0:08:48.540
-there are
-
-0:08:48.540,0:08:51.850
-just idle threads.
-
-0:08:51.850,0:08:56.640
-Because it's like saying that
-runqueues are complete flushed
-
-0:08:56.640,0:09:02.450
-and you have all the threads sleeping
-in their own containers.
-
-0:09:02.450,0:09:07.850
-You need to know which are the exact locks
-that are acquired
-
-0:09:07.850,0:09:11.270
-in the system
-
-0:09:11.270,0:09:15.570
-and that's something that WITNESS provides
-
-0:09:15.570,0:09:20.720
-and the very important thing is
-to know why the threads are stopping.
-
-0:09:20.720,0:09:24.250
-So one of the most important things is
-retrieving what the threads were doing
-
-0:09:24.250,0:09:26.320
-when
-
-0:09:26.320,0:09:28.960
-they were put asleep.
-
-0:09:28.960,0:09:30.070
-
-
-0:09:30.070,0:09:33.009
-The backtraces of all the threads involved
-
-0:09:33.009,0:09:37.130
-are printed out in order to identify deadlocks.
-
-0:09:37.130,0:09:38.589
-In the case that
-
-0:09:38.589,0:09:42.830
-buffered cache and VFS are
-
-0:09:42.830,0:09:45.910
-probably parts of the deadlocking,
-
-0:09:45.910,0:09:50.790
-you should also print out
-
-0:09:50.790,0:09:53.420
-the information about vnodes
-
-0:09:53.420,0:09:58.250
-and what we're interested in is which vnodes are called,
-
-0:09:58.250,0:09:59.320
-which
-
-0:09:59.320,0:10:01.370
-are actually referenced
-
-0:10:01.370,0:10:03.530
-and
-
-0:10:03.530,0:10:10.530
-in which way they were called.
-
-0:10:11.030,0:10:13.380
-So,
-
-0:10:13.380,0:10:15.770
-this is an example
-
-0:10:15.770,0:10:17.430
-of the
-
-0:10:17.430,0:10:18.880
-thread states
-
-0:10:18.880,0:10:20.760
-in the case of a deadlock.
-
-0:10:20.760,0:10:27.480
-This is an real example of a deadlock
-
-0:10:27.480,0:10:28.900
-but you can see
-
-0:10:28.900,0:10:31.890
-that
-
-0:10:31.890,0:10:35.650
-this is not totally complete.
-
-0:10:35.650,0:10:38.450
-But you can see that all the threads are sleeping.
-
-0:10:38.450,0:10:39.870
-
-
-0:10:39.870,0:10:43.580
-This one is the message
-
-0:10:43.580,0:10:44.790
-used by the wait channel
-
-0:10:44.790,0:10:47.550
-on which they're sleeping on
-
-0:10:47.550,0:10:48.710
-or used by
-
-0:10:48.710,0:10:54.480
-the container like the turnstile or the sleepqueue.
-
-0:10:54.480,0:10:59.410
-If I recall correctly, it's a forced amount
-that does deadlocking at some point.
-
-0:10:59.410,0:11:01.290
-I'm not really sure
-
-0:11:01.290,0:11:04.190
-because I should have looked at it.
-
-0:11:04.190,0:11:08.810
-You can see that the revelant command here
-is -ps
-
-0:11:08.810,0:11:11.220
-that DDB supports.
-
-0:11:11.220,0:11:14.220
-
-
-0:11:14.220,0:11:17.520
-Another important thing
-
-0:11:17.520,0:11:18.820
-is getting
-
-0:11:18.820,0:11:21.680
-the situation of all CPUs.
-
-0:11:21.680,0:11:24.100
-As you can see there,
-
-0:11:24.100,0:11:25.210
-usually
-
-0:11:25.210,0:11:31.600
-its because you can add some data structures corrupted
-
-0:11:31.600,0:11:34.320
-in the per-CPU datas.
-
-0:11:34.320,0:11:38.830
-That's a very common situation where you can get deadlocks,
-
-0:11:38.830,0:11:40.280
-because, for example,
-
-0:11:40.280,0:11:43.149
-leaving a corrupted LPD will lead
-
-0:11:48.750,0:11:55.290
-to you having a bigger massive breakage like
-double-faults and things like that. Usually it's always a
-good idea to look at all the CPUs involved in the system.
-
-0:11:55.290,0:11:57.310
-The command
-
-0:11:57.310,0:12:00.120
-is """"-show allpcpu"".
-
-0:12:00.120,0:12:04.960
-
-
-0:12:04.960,0:12:06.959
-This one
-
-0:12:06.959,0:12:12.009
-is a WITNESS specific command ""-show alllocks""
-and it's going to show all the locks,
-
-0:12:12.009,0:12:13.130
-in the system,
-
-0:12:13.130,0:12:15.070
-who is the owner,
-
-0:12:15.070,0:12:15.850
-like in this case,
-
-0:12:15.850,0:12:17.690
-a mount,
-
-0:12:17.690,0:12:21.270
-and the thread is this one,
-
-0:12:21.270,0:12:23.660
-what the lock is holding,
-
-0:12:23.660,0:12:24.970
-that's the address
-
-0:12:24.970,0:12:27.360
-and where it was acquired.
-
-0:12:27.360,0:12:31.140
-It gives you lines and file.
-
-0:12:31.140,0:12:32.770
-
-
-0:12:32.770,0:12:34.730
-Actually,
-
-0:12:34.730,0:12:37.620
-that's just possible
-
-0:12:37.620,0:12:40.859
-with WITNESS, because otherwise,
-
-0:12:40.859,0:12:44.410
-trying to keep the oldest information
-
-0:12:44.410,0:12:51.410
-in a general purpose kernel will be
-very expensive for our logging subsystem.
-
-0:12:55.330,0:12:59.730
-Then, the most important thing is
-the backtrace for any thread.
-
-0:12:59.730,0:13:01.150
-
-
-0:13:01.150,0:13:03.390
-It's going to show the backtrace
-
-0:13:03.390,0:13:05.700
-for all the threads.
-
-0:13:05.700,0:13:08.380
-the seas
-
-0:13:08.380,0:13:09.169
-In this case,
-
-0:13:09.169,0:13:13.010
-the thread with these addresses TID and PID
-
-0:13:13.010,0:13:15.350
-basically got sleeping
-
-0:13:15.350,0:13:17.140
-on a vnode.
-
-0:13:17.140,0:13:22.020
-You will see that the backend in this case is FFF
-
-0:13:22.020,0:13:24.000
-and
-
-0:13:24.000,0:13:25.729
-that's the context switching function,
-
-0:13:25.729,0:13:26.900
-
-
-0:13:26.900,0:13:32.220
-those are the sleepqueues of the containter
-that is containing the threads,
-
-0:13:32.220,0:13:34.230
-and this one
-
-0:13:34.230,0:13:36.370
-is what it was going to do
-
-0:13:36.370,0:13:37.910
-before,
-
-0:13:37.910,0:13:41.810
-in this case mounting the filesystems.
-
-0:13:41.810,0:13:47.220
-You will see that on a complete feeding,
-
-0:13:47.220,0:13:50.310
-you will have a lot of these kinds of traces,
-
-0:13:50.310,0:13:53.079
-but they are very important
-
-0:13:53.079,0:13:59.270
-for the developers in order to understand
-what is going on.
-
-0:13:59.270,0:14:02.590
-These ones are the locked vnodes
-
-0:14:02.590,0:14:05.830
-that are also very important when
-
-0:14:05.830,0:14:11.780
-a deadlock happens in VFS or in the buffer cache.
-
-0:14:11.780,0:14:13.700
-You will see
-
-0:14:13.700,0:14:18.580
-that these are the ref counts linked to vnodes,
-
-0:14:18.580,0:14:20.980
-they are specific
-
-0:14:20.980,0:14:23.850
-to some handling of the vnodes such as recycling,
-
-0:14:23.850,0:14:26.020
-and completely freeing.
-
-0:14:26.020,0:14:27.290
-That's the mount point
-
-0:14:27.290,0:14:28.770
-where the vnodes
-
-0:14:28.770,0:14:31.740
-belong
-
-0:14:31.740,0:14:33.930
-and
-
-0:14:33.930,0:14:35.910
-that is the backtrace
-
-0:14:35.910,0:14:39.760
-of what happened when the vnode
-
-0:14:39.760,0:14:41.060
-was acquired.
-
-0:14:41.060,0:14:46.600
-You can see that this comment also gives you information
-
-0:14:46.600,0:14:49.000
-about the lock linked to the vnode.
-
-0:14:49.000,0:14:51.640
-For example, it tells you that
-
-0:14:51.640,0:14:52.830
-the lock
-
-0:14:52.830,0:14:55.040
-is in exclusive mode
-
-0:14:55.040,0:14:56.280
-and
-
-0:14:56.280,0:14:59.320
-it does some shared waits
-
-0:14:59.320,0:15:03.260
-on its queues.
-
-0:15:03.260,0:15:04.090
-That's also
-
-0:15:04.090,0:15:06.370
-the node number.
-
-0:15:06.370,0:15:09.140
-
-
-0:15:09.140,0:15:13.880
-There is also other information you could receive
-from the DDB linked to, for example,
-
-0:15:13.880,0:15:16.980
-the bugging deadlocks,
-
-0:15:16.980,0:15:18.100
-like sleep chains,
-
-0:15:18.100,0:15:19.310
-for any
-
-0:15:19.310,0:15:24.250
-wait channel, if you have the address
-
-0:15:24.250,0:15:27.150
-and for example,
-
-0:15:27.150,0:15:32.650
-you can also print the wall table of
-the lock relations from WITNESS
-
-0:15:32.650,0:15:38.010
-but it's mostly never useful
-because you should already know that.
-
-0:15:38.010,0:15:41.100
-So you will just need to know which is the one
-
-0:15:41.100,0:15:41.980
-that
-
-0:15:41.980,0:15:43.019
-
-
-0:15:43.019,0:15:47.750
-can give the trouble.
-
-0:15:47.750,0:15:51.640
-
-0:15:51.640,0:15:53.980
-So if you are going to submit some problems
-
-0:15:53.980,0:15:57.180
-usually called NGAP that are probably
-deadlocked in the kernel space,
-
-0:15:57.180,0:16:04.130
-that ones
-
-0:16:04.130,0:16:11.130
-are the information that we actually need.
-
-0:16:11.650,0:16:14.970
-Now,
-
-0:16:14.970,0:16:18.950
-it's very difficult to see very good reports
-about deadlocks,
-
-0:16:18.950,0:16:20.020
-so
-
-0:16:20.020,0:16:22.569
-I think that
-
-0:16:22.569,0:16:25.670
-it is a very good thing to talk about it.
-
-0:16:25.670,0:16:31.420
-Along with the WITNESS, we have another
-important mechanism that could help us with deadlocks
-
-0:16:31.420,0:16:34.620
-and it's called KTR.
-
-0:16:34.620,0:16:36.100
-KTR is
-
-0:16:36.100,0:16:40.630
-basically a logger, a kernel logger, of events.
-
-0:16:40.630,0:16:42.550
-It's
-
-0:16:42.550,0:16:45.090
-highly configurable,
-
-0:16:45.090,0:16:48.280
-as you can, for example,
-handle different classes of events.
-
-0:16:48.280,0:16:53.940
-In FreeBSD we have
-
-0:16:53.940,0:16:55.130
-classes linked to the scheduler,
-
-0:16:55.130,0:16:56.290
-to the locking,
-
-0:16:56.290,0:16:58.520
-to the VFS, to callouts,
-
-0:16:58.520,0:17:05.030
-and they are all packed in the same class.
-
-0:17:05.030,0:17:08.880
-So they can be selectively enabled or masked.
-
-0:17:08.880,0:17:10.030
-For example
-
-0:17:10.030,0:17:12.190
-the difference is that you can
-enable several classes,
-
-0:17:12.190,0:17:13.610
-like
-
-0:17:13.610,0:17:16.470
-the ten classes of the KTR
-
-0:17:16.470,0:17:21.000
-and then you are just interested in three of them
-even if all ten of them are
-
-0:17:21.000,0:17:23.030
-actually tracked.
-
-0:17:23.030,0:17:24.240
-You will are just going to
-
-0:17:24.240,0:17:26.839
-see three of them.
-
-0:17:26.839,0:17:28.439
-Um,
-
-0:17:28.439,0:17:31.940
-an important thing is that KTR
-
-0:17:31.940,0:17:34.520
-doesn't handle,
-
-0:17:34.520,0:17:37.770
-for example
-
-0:17:37.770,0:17:38.300
-pointers,
-
-0:17:38.300,0:17:40.340
-doesn't store the information
-
-0:17:40.340,0:17:45.450
-passed to init, it just stores the pointer
-
-0:17:45.450,0:17:46.880
-and not the information,
-
-0:17:46.880,0:17:48.390
-for example,
-
-0:17:48.390,0:17:50.160
-the strings,
-
-0:17:50.160,0:17:55.000
-it doesn't make copies, you need to just pass
-the pointers
-
-0:17:55.000,0:17:57.610
-which are persistent in the memory.
-
-0:17:57.610,0:18:00.340
-Otherwise,
-
-0:18:00.340,0:18:05.500
-KTR won't be able to access them.
-
-0:18:05.500,0:18:09.760
-The good thing about KTR is that
-
-0:18:09.760,0:18:11.600
-you can also look at it from the user space
-
-0:18:11.600,0:18:13.430
-through the ktrdump interface.
-
-0:18:13.430,0:18:15.820
-
-
-0:18:15.820,0:18:17.030
-
-
-0:18:17.030,0:18:19.669
-Why is that important for locking?
-
-0:18:19.669,0:18:20.279
-Because
-
-0:18:20.279,0:18:21.090
-after,
-
-0:18:21.090,0:18:24.350
-it can tell you what happed,
-
-0:18:24.350,0:18:27.260
-on which CPU branches,
-
-0:18:27.260,0:18:30.020
-and the order it happened in.
-
-0:18:30.020,0:18:34.580
-This is very important when you're
-going to track down for example traces,
-
-0:18:34.580,0:18:37.720
-when you are not sure about the order of operations and
-
-0:18:37.720,0:18:44.710
-how they happened. It's going to tell you.
-
-0:18:44.710,0:18:46.290
-For example
-
-0:18:46.290,0:18:48.650
-that is
-
-0:18:48.650,0:18:51.090
-a typical trace of KTR,
-
-0:18:51.090,0:18:52.410
-where you have
-
-0:18:52.410,0:18:56.890
-the CPU where the event happened, thats the index,
-
-0:18:56.890,0:18:58.620
-that's a timestamp,
-
-0:18:58.620,0:19:03.400
-I think it's retrieved directly from the TSC,
-but i'm actually not sure.
-
-0:19:03.400,0:19:04.889
-In this case,
-
-0:19:04.889,0:19:10.210
-i was tracking down the scheduler class,
-
-0:19:10.210,0:19:16.100
-so I was interested mainly in scheduler
-workloads and I could see
-
-0:19:16.100,0:19:19.210
-for example
-
-0:19:19.210,0:19:21.100
-that
-
-0:19:21.100,0:19:24.870
-a contact switch happened
-
-0:19:24.870,0:19:26.919
-scheduling
-
-0:19:26.919,0:19:28.010
-the idle CPU
-
-0:19:28.010,0:19:30.270
-and then other information,
-
-0:19:30.270,0:19:34.370
-like for example the load of the CPU 1
-
-0:19:34.370,0:19:37.190
-and scan priority boost,
-
-0:19:37.190,0:19:38.870
-like
-
-0:19:38.870,0:19:40.310
-this one
-
-0:19:40.310,0:19:46.420
-and other things.
-
-0:19:46.420,0:19:48.770
-
-
-0:19:48.770,0:19:50.820
-You can enable
-
-0:19:50.820,0:19:55.280
-the option KTR, but you must handle it carefully.
-
-0:19:55.280,0:19:57.130
-For example
-
-0:19:57.130,0:20:01.990
-use an option i didn't include here,
-
-0:20:01.990,0:20:07.410
-which is the length of the buffer it uses
-to store the pointers in, it's called KTR_ENTRIES,
-
-0:20:07.410,0:20:08.360
-and you should specify
-
-0:20:08.360,0:20:09.590
-enough entries
-
-0:20:09.590,0:20:11.500
-to have a reliable tracking.
-
-0:20:11.500,0:20:13.580
-
-
-0:20:13.580,0:20:16.780
-For example, if you are going to track a lot of events,
-
-0:20:16.780,0:20:19.100
-a short queue is not an option,
-
-0:20:19.100,0:20:22.120
-because you are going to lose some information.
-
-0:20:22.120,0:20:26.780
-A typical queue is of length 2K (2 kilobytes)
-
-0:20:26.780,0:20:29.710
-of entries.
-
-0:20:29.710,0:20:32.520
-These other options
-
-0:20:32.520,0:20:36.190
-let you compile some classes,
-
-0:20:36.190,0:20:38.370
-or mask them,
-
-0:20:38.370,0:20:43.770
-or even mask the CPU.
-
-0:20:43.770,0:20:46.289
-If you have a big SMP environment,
-
-0:20:46.289,0:20:50.160
-so that you can selectively enable some of them.
-
-0:20:50.160,0:20:54.700
-For example, this is very good for
-tracking down traces in the sleeping queue.
-
-0:20:54.700,0:21:01.700
-You can find referrals here.
-
-0:21:02.770,0:21:04.820
-
-
-0:21:04.820,0:21:06.220
-So,
-
-0:21:06.220,0:21:09.020
-I will spend the last time of the speech
-speaking about possible improvements
-
-0:21:09.020,0:21:10.500
-to
-
-0:21:10.500,0:21:15.670
-our locking system, which is not very bad.
-
-0:21:15.670,0:21:16.500
-Well,
-
-0:21:16.500,0:21:21.750
-I think that actually our locking system
-is pretty complete,
-
-0:21:21.750,0:21:26.919
-but it's also pretty confusing for newcomers,
-it's not widely documented.
-
-0:21:26.919,0:21:32.280
-so maybe we will spend a good amount of time
-on documentation.
-
-0:21:32.280,0:21:32.799
-As you can see,
-
-0:21:32.799,0:21:38.120
-even in this presentation, which is not very huge,
-
-0:21:38.120,0:21:42.540
-there are many things to say
-
-0:21:42.540,0:21:46.700
-and that are not very simple to understand in particular
-
-0:21:46.700,0:21:48.240
-for people
-
-0:21:48.240,0:21:50.280
-who just need to do simple tasks.
-
-0:21:50.280,0:21:56.660
-For example, I saw a lot of guys coming from Linux World
-
-0:21:56.660,0:22:00.620
-who wanted to actually use spinlocks for time.
-
-0:22:00.620,0:22:05.720
-It's obvious they are missing something from our
-architecture.
-
-0:22:05.720,0:22:07.250
-From
-
-0:22:07.250,0:22:11.010
-just a technical point of view,
-
-0:22:11.010,0:22:14.530
-it would be very good if we could remove
-
-0:22:14.530,0:22:20.440
-legacy support and overriding support. For example,
-
-0:22:20.440,0:22:22.900
-we have lockmgr and sxlog
-
-0:22:22.900,0:22:27.990
-which are both read/write locks and
-are both servered by sleep queues.
-
-0:22:27.990,0:22:31.800
-They have some differences, obviously,
-
-0:22:31.800,0:22:32.660
-but, mainly,
-
-0:22:32.660,0:22:38.920
-we could manage the missing bits and
-just use one of the two interfaces.
-
-0:22:38.920,0:22:42.059
-
-
-0:22:42.059,0:22:43.920
-In the same way, as i told you before,
-
-0:22:43.920,0:22:47.340
-the sleeping point, true-end sleep,
-read/write sleep and sxsleep
-
-0:22:47.340,0:22:50.350
-should probably be managed with cond vars
-
-0:22:50.350,0:22:52.930
-and superdoff our kernel
-
-0:22:52.930,0:22:55.070
-and we should probably drop sema,
-
-0:22:55.070,0:23:00.290
-because it is obsolete, and can be
-replaced by condvars and mutex.
-
-0:23:00.290,0:23:02.620
-
-
-0:23:02.620,0:23:03.830
-From
-
-0:23:03.830,0:23:05.639
-a strong technical point of view,
-
-0:23:05.639,0:23:07.350
-as you can see,
-
-0:23:07.350,0:23:09.680
-we spent a lot of time
-
-0:23:09.680,0:23:12.109
-on optimization on our blocking primitives,
-
-0:23:12.109,0:23:16.770
-but very few on our spinning primitives.
-
-0:23:16.770,0:23:21.810
-That's because obviously blocking
-primitives are our first choice,
-
-0:23:21.810,0:23:22.700
-but
-
-0:23:22.700,0:23:25.210
-spinlocks could be improved too,
-
-0:23:25.210,0:23:30.130
-using technics such as the so-called
-back-off algorithms and
-
-0:23:30.130,0:23:31.429
-queued spinlocks.
-
-0:23:31.429,0:23:33.560
-I have a patch for that but
-
-0:23:33.560,0:23:35.499
-I would really need
-
-0:23:35.499,0:23:39.999
-to test it and tune it on a big SMP environment.
-
-0:23:39.999,0:23:42.270
-I don't think
-
-0:23:42.270,0:23:43.540
-that, for now,
-
-0:23:43.540,0:23:45.730
-they can handle such an environment.
-
-0:23:45.730,0:23:47.820
-In addition,
-
-0:23:47.820,0:23:51.500
-I'm not sure you are familiar with
-queued spinlocks algorithms.
-
-0:23:51.500,0:23:55.580
-Basically a back-off algorithms try
-
-0:23:55.580,0:24:02.250
-to reduce the cache pressure on the
-
-0:24:02.250,0:24:06.090
-threads contesting on the lock
-
-0:24:06.090,0:24:08.270
-by giving a time.
-
-0:24:08.270,0:24:10.290
-Instead, the other one
-
-0:24:10.290,0:24:16.180
-uses spinning on a local variable
-which is not shared by the threads.
-
-0:24:16.180,0:24:18.030
-and the time spent
-
-0:24:18.030,0:24:20.140
-on that
-
-0:24:20.140,0:24:22.780
-local variable increases
-
-0:24:22.780,0:24:25.440
-
-
-0:24:25.440,0:24:28.320
-
-
-0:24:28.320,0:24:31.780
-with the passing of time.
-
-0:24:31.780,0:24:35.950
-Another interesting thing would be benchmarking
-
-0:24:35.950,0:24:37.930
-different wake-up algorithms for blocking primitives.
-
-0:24:37.930,0:24:42.390
-We have an algorithm that has proven to be
-
-0:24:42.390,0:24:42.910
-
-
-0:24:42.910,0:24:45.200
-quite good
-
-0:24:45.200,0:24:47.440
-but
-
-0:24:47.440,0:24:51.330
-we are not confronted with other kind of
-wake-ups that could have
-
-0:24:51.330,0:24:56.450
-a higher overhead but could give time improvements
-
-0:24:56.450,0:24:59.760
-on a big SMP environment.
-
-0:24:59.760,0:25:02.500
-
-
-0:25:02.500,0:25:07.000
-Another thing that would be very interesting
-to fix is the priority inversion problem
-
-0:25:07.000,0:25:08.670
-in the case of read locks.
-
-0:25:08.670,0:25:09.790
-There is an approach
-
-0:25:09.790,0:25:13.950
-called dominant record implemented in FreeBSD,
-
-0:25:13.950,0:25:16.360
-but i saw that it's pretty
-
-0:25:16.360,0:25:22.160
-slow for our fastpack. In FreeBSD all our locking
-primitives are broken and add to path
-
-0:25:22.160,0:25:23.290
-where
-
-0:25:23.290,0:25:25.820
-the fastpack is
-
-0:25:25.820,0:25:28.620
-is often just a single atomic operation,
-
-0:25:28.620,0:25:30.010
-and
-
-0:25:30.010,0:25:33.770
-if it fails,
-
-0:25:33.770,0:25:36.900
-it falls down and the art pattern tries to do
-all the end-work of the sleep queues.
-
-0:25:36.900,0:25:40.210
-in this case the owner of record technic
-was going to make the fastpack too simple
-
-0:25:40.210,0:25:42.640
-Basically,
-
-0:25:42.640,0:25:46.310
-it just considers
-
-0:25:46.310,0:25:50.940
-the readets as one
-
-0:25:50.940,0:25:55.380
-and can switch in and out.
-
-0:25:55.380,0:26:02.210
-And it practically lands the priority to this
-owner of record which does it's right log.
-
-0:26:02.210,0:26:06.900
-
-
-0:26:06.900,0:26:11.900
-Another important thing obviously is improving locking
-
-0:26:11.900,0:26:13.420
-where the
-
-0:26:13.420,0:26:15.649
-optimum approach is not chosen.
-
-0:26:15.649,0:26:21.039
-I see a lot of parts in which the
-primitives chosen by the developers
-
-0:26:21.039,0:26:23.320
-are not the most suitable ones
-
-0:26:23.320,0:26:27.690
-and we should switch to the right one.
-
-0:26:27.690,0:26:33.120
-Like, for example, in discriminated usage
-of the spinlocks
-
-0:26:33.120,0:26:34.990
-or the blocking primitives,
-
-0:26:34.990,0:26:40.430
-just to handle cases like that,
-like the one we saw before with the malloc command,
-
-0:26:40.430,0:26:44.070
-that needs to sleep.
-
-0:26:44.070,0:26:44.320
-Any questions?