aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMurray Stokely <murray@FreeBSD.org>2010-02-08 21:53:03 +0000
committerMurray Stokely <murray@FreeBSD.org>2010-02-08 21:53:03 +0000
commit8b8b0168e448460cef83bd8e7632467c30b2cec2 (patch)
tree237f6a9e9641293e82d8dc535f1e259c99b652f8
parent44e3c10d1ae747221f5752f96733c583b57979f6 (diff)
downloaddoc-8b8b0168e448460cef83bd8e7632467c30b2cec2.tar.gz
doc-8b8b0168e448460cef83bd8e7632467c30b2cec2.zip
Add improved human edited transcript from Amazon Mechanical Turk and
YouTube Automatic Captions. Sponsored by: FreeBSD Foundation
Notes
Notes: svn path=/head/; revision=35344
-rw-r--r--en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv1650
1 files changed, 1650 insertions, 0 deletions
diff --git a/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv
new file mode 100644
index 0000000000..16f08a914a
--- /dev/null
+++ b/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv
@@ -0,0 +1,1650 @@
+0:00:00.530,0:00:01.590
+So, basically,
+
+0:00:04.590,0:00:10.029
+we are going to look mainly in this second part
+at how to
+
+0:00:10.029,0:00:11.519
+handle some
+
+0:00:11.519,0:00:12.560
+locking problems
+
+0:00:12.560,0:00:17.910
+that categorize in the kernel.
+
+0:00:17.910,0:00:24.410
+Here there are described two kinds of problems
+you can get with locks, that are pretty much common
+
+0:00:24.410,0:00:27.859
+The first one is called Lock Order Reversal
+
+0:00:27.859,0:00:30.140
+when you have for example a thread A
+
+0:00:30.140,0:00:32.340
+which owns
+
+0:00:32.340,0:00:35.870
+a lock code, for example, L1
+
+0:00:35.870,0:00:37.920
+and another thread B
+
+0:00:37.920,0:00:40.070
+which owns another lock, L2
+
+0:00:40.070,0:00:43.150
+Then thread A tries to
+
+0:00:43.150,0:00:44.730
+Right, it's wrong.
+
+0:00:44.730,0:00:46.220
+The slide is wrong.
+
+0:00:46.220,0:00:48.020
+The slide is wrong.
+
+0:00:48.020,0:00:51.910
+Thread A tries to acquire L2,
+
+0:00:51.910,0:00:55.670
+but obviously it sleeps because
+it is owned by thread B
+
+0:00:55.670,0:00:58.530
+and then thread B tries to acquire
+
+0:00:58.530,0:01:00.030
+the lock L1
+
+0:01:00.030,0:01:02.240
+and it sleeps because
+
+0:01:02.240,0:01:06.440
+it's owned by thread B.
+
+0:01:06.440,0:01:11.410
+This is a situation that never ends and
+it's pretty much well documented in Cormen
+
+0:01:11.410,0:01:16.150
+and in traditional literature.
+
+0:01:16.150,0:01:18.650
+It's a classical deadlock.
+
+0:01:18.650,0:01:19.960
+This means that,
+
+0:01:19.960,0:01:21.950
+as everybody who
+
+0:01:21.950,0:01:24.940
+has ever read a book
+
+0:01:24.940,0:01:25.899
+about an operating system,
+
+0:01:25.899,0:01:30.420
+knows that
+
+0:01:30.420,0:01:32.910
+locks should maintain
+
+0:01:32.910,0:01:34.319
+an ordering regard of each other.
+
+0:01:34.319,0:01:38.859
+That's not very simple when
+
+0:01:38.859,0:01:40.100
+you speak about a kernel.
+
+0:01:40.100,0:01:44.850
+From this point of view, the fact
+that there are 3 kinds of classes of locks
+
+0:01:44.850,0:01:49.180
+is going to count because you can
+never mix two different kinds of locks
+
+0:01:49.180,0:01:50.680
+for example
+
+0:01:50.680,0:01:51.610
+a spin lock
+
+0:01:51.610,0:01:53.770
+and a mutex.
+
+0:01:53.770,0:01:59.120
+You can mix in this way.
+
+0:01:59.120,0:02:01.720
+
+
+0:02:01.720,0:02:07.060
+So, you will see that this kind
+of deadlocks are possible
+
+0:02:07.060,0:02:09.290
+only in the same class of locks,
+
+0:02:09.290,0:02:13.019
+like for example 2 mutex or 2 spin mutex,
+
+0:02:13.019,0:02:14.569
+or such.
+
+0:02:14.569,0:02:16.090
+Um...
+
+0:02:16.090,0:02:17.409
+Also
+
+0:02:17.409,0:02:19.949
+Even if it's not very well documented,
+
+0:02:19.949,0:02:22.880
+for example, spin locks,
+
+0:02:22.880,0:02:26.599
+in previous deep, as a way to
+identify such kind of deadlocks.
+
+0:02:26.599,0:02:27.619
+And it's pretty much implemented...
+
+0:02:27.619,0:02:29.709
+a very much in it would
+
+0:02:29.709,0:02:32.449
+It's a feature enabled in the code.
+
+0:02:32.449,0:02:34.949
+They just count how many times they are spinning
+
+0:02:34.949,0:02:36.010
+and
+
+0:02:36.010,0:02:39.169
+if it exceeds
+
+0:02:39.169,0:02:41.379
+an exaggerated result,
+
+0:02:41.379,0:02:47.870
+it means that they are probable
+under a deadlock and the system panics.
+
+0:02:47.870,0:02:52.489
+Another common problem about locking
+is when you have
+
+0:02:52.489,0:02:55.189
+a wait channel
+
+0:02:55.189,0:02:56.659
+like a condition variable (cond var),
+
+0:02:56.659,0:02:58.849
+which it is protected by a lock.
+
+0:02:58.849,0:03:03.629
+If there are some ratio chances that
+this condition variable encounters,
+
+0:03:03.629,0:03:05.489
+for example,
+
+0:03:05.489,0:03:07.219
+when cond var are chased
+
+0:03:07.219,0:03:12.649
+with some preliminary conditions,
+like a waiter's counter
+
+0:03:12.649,0:03:16.359
+that has to be updated
+
+0:03:16.359,0:03:21.249
+anytime that any thread tries to sleep
+on the counter or flags to be set.
+
+0:03:21.249,0:03:24.909
+If these conditions are not protected
+by the same lock,
+
+0:03:24.909,0:03:30.569
+you can end up having some threads
+sleeping on this wait channel
+
+0:03:30.569,0:03:34.589
+and nobody is going to wake up them again.
+
+0:03:34.589,0:03:37.629
+This is usually called missed wakeup
+
+0:03:37.629,0:03:41.249
+and it's a pretty common mistake
+
+0:03:41.249,0:03:44.799
+that leads to a deadlock.
+
+0:03:44.799,0:03:46.719
+The problem is that
+
+0:03:46.719,0:03:52.109
+it's very difficult to differentiate
+between missed wakeup and,
+
+0:03:52.109,0:03:53.480
+for example,
+
+0:03:53.480,0:03:56.189
+forever sleep
+
+0:03:56.189,0:03:58.419
+of a thread
+
+0:03:58.419,0:04:01.859
+that is not likely to be awaken.
+
+0:04:01.859,0:04:07.109
+So these kind of deadlocks are
+very difficult to be discovered
+
+0:04:07.109,0:04:11.669
+and will require some bit of
+work that we will see right now.
+
+0:04:11.669,0:04:14.509
+For example,
+
+0:04:14.509,0:04:15.270
+using
+
+0:04:15.270,0:04:16.219
+some
+
+0:04:16.219,0:04:18.179
+kernel systems
+
+0:04:18.179,0:04:22.240
+and some things integrated into the debugger.
+
+0:04:22.240,0:04:22.979
+Um,
+
+0:04:22.979,0:04:25.520
+In FreeBSD,
+
+0:04:25.520,0:04:29.859
+we have quite a lot of good mechanisms
+we can use to cope
+
+0:04:29.859,0:04:32.080
+with kernel problems.
+
+0:04:32.080,0:04:36.539
+The first one (and the most important)
+is called witness.
+
+0:04:36.539,0:04:39.169
+It was introduced
+
+0:04:39.169,0:04:42.080
+in the context of SMPng
+
+0:04:42.080,0:04:44.979
+and has been written in the recent past,
+
+0:04:44.979,0:04:47.919
+mainly by a contribution of
+
+0:04:47.919,0:04:51.360
+Isilon systems.
+
+0:04:51.360,0:04:52.270
+They contributed back then
+
+0:04:52.270,0:04:54.989
+to the writing of witness.
+
+0:04:54.989,0:04:57.389
+This subsystem is very important
+
+0:04:57.389,0:05:02.730
+because it tracks down exactly every order
+
+0:05:02.730,0:05:03.949
+of the locks
+
+0:05:03.949,0:05:07.810
+So that, if there is an ordering violation like a LOR,
+
+0:05:07.810,0:05:09.439
+it's going to
+
+0:05:09.439,0:05:12.150
+tell the system.
+
+0:05:12.150,0:05:18.029
+You can even set it to directly panic if
+it finds some deadlocks,
+
+0:05:18.029,0:05:19.879
+or only
+
+0:05:19.879,0:05:22.729
+some possible deadlocks.
+
+0:05:22.729,0:05:27.260
+Another important feature with it is
+
+0:05:27.260,0:05:32.569
+that it can keep track of read/write locks.
+
+0:05:32.569,0:05:33.690
+Doing that,
+
+0:05:33.690,0:05:36.539
+we can identify
+
+0:05:36.539,0:05:38.419
+deadlocks possible
+
+0:05:38.419,0:05:39.500
+even
+
+0:05:39.500,0:05:40.690
+on the
+
+0:05:40.690,0:05:45.529
+reader's path.
+
+0:05:45.529,0:05:49.609
+We could say that witness is pretty big,
+
+0:05:49.609,0:05:52.289
+so activating it
+
+0:05:52.289,0:05:55.039
+in your production system is never an option.
+
+0:05:55.039,0:05:59.929
+It's mainly used when you are going to
+develop a new feature in the kernel
+
+0:05:59.929,0:06:02.110
+and you are going to test it heavily,
+
+0:06:02.110,0:06:05.479
+in particular if it has
+
+0:06:05.479,0:06:06.819
+some
+
+0:06:06.819,0:06:10.509
+relation to locking.
+
+0:06:10.509,0:06:13.089
+Uh,
+
+0:06:13.089,0:06:17.840
+We could also tell that with the new code
+provided by Isilon and Nokia,
+
+0:06:17.840,0:06:19.150
+basically
+
+0:06:19.150,0:06:21.689
+the orad
+
+0:06:21.689,0:06:25.479
+offered by witness is greatly reduced to about
+
+0:06:25.479,0:06:27.699
+the 10th part of
+
+0:06:27.699,0:06:30.240
+what we had before.
+
+0:06:30.240,0:06:36.150
+Witness is very good at tracking LOR,
+
+0:06:36.150,0:06:37.849
+but
+
+0:06:37.849,0:06:40.009
+it's not very good at, for example,
+
+0:06:40.009,0:06:42.449
+trying to
+
+0:06:42.449,0:06:44.060
+help you
+
+0:06:44.060,0:06:47.479
+in the case of lost wakeups,
+
+0:06:47.479,0:06:49.519
+because of its nature,
+
+0:06:49.519,0:06:52.090
+mainly.
+
+0:06:52.090,0:06:55.889
+It has a very good integration with the DDB debugger
+
+0:06:55.889,0:06:57.740
+and
+
+0:06:57.740,0:06:58.879
+basically
+
+0:06:58.879,0:07:04.159
+and basically, it's in the 8th release,
+we have new features
+
+0:07:04.159,0:07:05.759
+that help you
+
+0:07:05.759,0:07:08.389
+print out backtraces
+
+0:07:08.389,0:07:11.150
+of the contesting threads on the LORs
+
+0:07:11.150,0:07:16.039
+and their orderings
+
+0:07:16.039,0:07:17.549
+and
+
+0:07:17.549,0:07:23.550
+and shows some graphs of the relations.
+Even from the user space,
+
+0:07:23.550,0:07:28.550
+you don't have to go into the kernel
+degubber to look at it's output.
+
+0:07:28.550,0:07:35.550
+Well
+
+0:07:35.620,0:07:37.380
+...
+
+0:07:37.380,0:07:42.250
+Well, I see that sometimes when
+they are released there is a confusion
+
+0:07:42.250,0:07:44.250
+about the information reports
+
+0:07:44.250,0:07:48.440
+in regard of deadlocks conditions and what help
+
+0:07:48.440,0:07:50.020
+users can provide to developers
+
+0:07:50.020,0:07:52.039
+about that.
+
+0:07:52.039,0:07:54.020
+So we are going to see
+
+0:07:54.020,0:07:58.700
+all the relevant information
+when a deadlock
+
+0:07:58.700,0:07:59.590
+is in the kernel.
+
+0:07:59.590,0:08:02.490
+...
+
+0:08:02.490,0:08:03.389
+Usually,
+
+0:08:03.389,0:08:07.939
+if you want to find a deadlock
+that's happening in the kernel
+
+0:08:07.939,0:08:10.909
+your first line of analysis start from the DDB
+
+0:08:10.909,0:08:13.919
+instead of a post-mortem analysis,
+
+0:08:13.919,0:08:16.839
+which is even more important,
+
+0:08:16.839,0:08:22.330
+but using DDB you will get more
+processes and better information.
+
+0:08:22.330,0:08:24.970
+Uh,
+
+0:08:24.970,0:08:28.499
+The most important unit in order to find the deadlock
+
+0:08:28.499,0:08:34.389
+are the LORs reported by witness in order
+to see if there is something strange
+
+0:08:34.389,0:08:36.690
+that can be happening.
+
+0:08:36.690,0:08:41.700
+You want to know the state of all the threads
+that are running on the system that is deadlocking.
+
+0:08:41.700,0:08:42.900
+...
+
+0:08:42.900,0:08:47.050
+You can see that you're deadlocking if you see that
+
+0:08:47.050,0:08:48.070
+on the runqueue
+
+0:08:48.070,0:08:48.540
+there are
+
+0:08:48.540,0:08:51.850
+just idle threads.
+
+0:08:51.850,0:08:56.640
+Because it's like saying that
+runqueues are complete flushed
+
+0:08:56.640,0:09:02.450
+and you have all the threads sleeping
+in their own containers.
+
+0:09:02.450,0:09:07.850
+You need to know which are exactly locks
+that are acquired
+
+0:09:07.850,0:09:11.270
+in the system
+
+0:09:11.270,0:09:15.570
+and that's something that witness provides,
+
+0:09:15.570,0:09:20.720
+and the very important things is
+to know why the threads are stopping.
+
+0:09:20.720,0:09:24.250
+So one on the most important things is
+retrieving what the threads were doing
+
+0:09:24.250,0:09:26.320
+when
+
+0:09:26.320,0:09:28.960
+they were put asleep.
+
+0:09:28.960,0:09:30.070
+...
+
+0:09:30.070,0:09:33.009
+The backtraces of all the threads involved
+
+0:09:33.009,0:09:37.130
+are so printed out in order to identify deadlocks.
+
+0:09:37.130,0:09:38.589
+In the case that
+
+0:09:38.589,0:09:42.830
+buffered cache and VFS are
+
+0:09:42.830,0:09:45.910
+probably parts of the deadlocking
+
+0:09:45.910,0:09:50.790
+you should also print out
+
+0:09:50.790,0:09:53.420
+the info about vnodes
+
+0:09:53.420,0:09:58.250
+and what we're interested in is which vnodes are called,
+
+0:09:58.250,0:09:59.320
+which
+
+0:09:59.320,0:10:01.370
+are actually referenced
+
+0:10:01.370,0:10:03.530
+and
+
+0:10:03.530,0:10:10.530
+which way they were called.
+
+0:10:11.030,0:10:13.380
+So
+
+0:10:13.380,0:10:15.770
+this is an example
+
+0:10:15.770,0:10:17.430
+of the
+
+0:10:17.430,0:10:18.880
+thread states
+
+0:10:18.880,0:10:20.760
+in the case of a deadlock.
+
+0:10:20.760,0:10:27.480
+This is an example of a real deadlock
+
+0:10:27.480,0:10:28.900
+but you can see
+
+0:10:28.900,0:10:31.890
+that
+
+0:10:31.890,0:10:35.650
+this is not totally complete.
+
+0:10:35.650,0:10:38.450
+But you can see that all the threads are sleeping.
+
+0:10:38.450,0:10:39.870
+Uh...
+
+0:10:39.870,0:10:43.580
+And this one is the message
+
+0:10:43.580,0:10:44.790
+used by the wait channel
+
+0:10:44.790,0:10:47.550
+on which they're sleeping on
+
+0:10:47.550,0:10:48.710
+or used by
+
+0:10:48.710,0:10:54.480
+the container like the turnsile or the sleep queue.
+
+0:10:54.480,0:10:59.410
+If I recall correctly, it's a forced amount
+that does deadlocking at some point.
+
+0:10:59.410,0:11:01.290
+I'm not really sure
+
+0:11:01.290,0:11:04.190
+because I have to take a look at it.
+
+0:11:04.190,0:11:08.810
+You can see that the revelant command here
+is -ps
+
+0:11:08.810,0:11:11.220
+that DDB supports.
+
+0:11:11.220,0:11:14.220
+Um,
+
+0:11:14.220,0:11:17.520
+Another important thing
+
+0:11:17.520,0:11:18.820
+is getting
+
+0:11:18.820,0:11:21.680
+the situation of all CPUs.
+
+0:11:21.680,0:11:24.100
+As you can see there,
+
+0:11:24.100,0:11:25.210
+usually
+
+0:11:25.210,0:11:31.600
+is because you can add some data structures corrupted
+
+0:11:31.600,0:11:34.320
+in the per-CPU datas.
+
+0:11:34.320,0:11:38.830
+That's a very common situation when you get deadlocks,
+
+0:11:38.830,0:11:40.280
+because, for example,
+
+0:11:40.280,0:11:43.149
+leaving a corrupted LPD will lead
+
+0:11:43.149,0:11:48.750
+I loved you too much review shellacking double
+falls and things like that about that
+
+0:11:48.750,0:11:55.290
+to you having a bigger massive breakage like
+double-faults. In general. it's a good idea to
+look at all the CPUs involved in the system.
+
+0:11:55.290,0:11:57.310
+The command
+
+0:11:57.310,0:12:00.120
+is ""show allpcpu""
+
+0:12:00.120,0:12:04.960
+Um,
+
+0:12:04.960,0:12:06.959
+This one
+
+0:12:06.959,0:12:12.009
+this one is a witness specific command -show alllocks
+and it's going to show all the locks,
+
+0:12:12.009,0:12:13.130
+in the system,
+
+0:12:13.130,0:12:15.070
+who is the owner,
+
+0:12:15.070,0:12:15.850
+like in this case,
+
+0:12:15.850,0:12:17.690
+a mount,
+
+0:12:17.690,0:12:21.270
+and the thread is this one,
+
+0:12:21.270,0:12:23.660
+what lock is holding,
+
+0:12:23.660,0:12:24.970
+that's the address
+
+0:12:24.970,0:12:27.360
+and where it was acquired.
+
+0:12:27.360,0:12:31.140
+It gives you lines and file.
+
+0:12:31.140,0:12:32.770
+...
+
+0:12:32.770,0:12:34.730
+Actually,
+
+0:12:34.730,0:12:37.620
+that's just possible
+
+0:12:37.620,0:12:40.859
+with witness, because otherwise,
+
+0:12:40.859,0:12:44.410
+trying to keep the oldest information
+
+0:12:44.410,0:12:51.410
+in a general purpose kernel will be
+very expensive for our logging subsystem.
+
+0:12:55.330,0:12:59.730
+Then, the most important thing is
+the backtrace for any thread.
+
+0:12:59.730,0:13:01.150
+...
+
+0:13:01.150,0:13:03.390
+It's going to show the backtrace
+
+0:13:03.390,0:13:05.700
+for old threads.
+
+0:13:05.700,0:13:08.380
+the seas
+
+0:13:08.380,0:13:09.169
+In this case,
+
+0:13:09.169,0:13:13.010
+the thread with these addresses TID and PID
+
+0:13:13.010,0:13:15.350
+basically got sleeping
+
+0:13:15.350,0:13:17.140
+on a pnode.
+
+0:13:17.140,0:13:22.020
+You will see that the backend in this case is FFF
+
+0:13:22.020,0:13:24.000
+and
+
+0:13:24.000,0:13:25.729
+that's the context switching function,
+
+0:13:25.729,0:13:26.900
+next
+
+0:13:26.900,0:13:32.220
+those are the sleepqueues of the containter
+that is containing the threads,
+
+0:13:32.220,0:13:34.230
+and this one
+
+0:13:34.230,0:13:36.370
+is what it was going to do
+
+0:13:36.370,0:13:37.910
+before,
+
+0:13:37.910,0:13:41.810
+in this case mounting the filesystems.
+
+0:13:41.810,0:13:47.220
+You will see that on a complete feeding,
+
+0:13:47.220,0:13:50.310
+you will have a lot of these kinds of traces,
+
+0:13:50.310,0:13:53.079
+but they are very important
+
+0:13:53.079,0:13:59.270
+so as developers to understand what is going on.
+
+0:13:59.270,0:14:02.590
+This ones are the locked vnodes
+
+0:14:02.590,0:14:05.830
+that are also very important when
+
+0:14:05.830,0:14:11.780
+a deadlock happens in VFS or in the buffer cache.
+
+0:14:11.780,0:14:13.700
+You will see
+
+0:14:13.700,0:14:18.580
+that these are the ref counts linked to vnodes,
+
+0:14:18.580,0:14:20.980
+they are specific
+
+0:14:20.980,0:14:23.850
+to some handling of the vnodes such as recycling,
+
+0:14:23.850,0:14:26.020
+and completely freeing
+
+0:14:26.020,0:14:27.290
+that's the mount point
+
+0:14:27.290,0:14:28.770
+where the vnodes
+
+0:14:28.770,0:14:31.740
+belong
+
+0:14:31.740,0:14:33.930
+and
+
+0:14:33.930,0:14:35.910
+that is the backtrace
+
+0:14:35.910,0:14:39.760
+of what happened when the vnode
+
+0:14:39.760,0:14:41.060
+was acquired.
+
+0:14:41.060,0:14:46.600
+You can see that this comment also gives you information
+
+0:14:46.600,0:14:49.000
+about the lock linked to the vnode.
+
+0:14:49.000,0:14:51.640
+For example, it tells you that
+
+0:14:51.640,0:14:52.830
+the lock
+
+0:14:52.830,0:14:55.040
+the lock is in exclusive mode
+
+0:14:55.040,0:14:56.280
+and
+
+0:14:56.280,0:14:59.320
+it does some shared waits
+
+0:14:59.320,0:15:03.260
+on its queues.
+
+0:15:03.260,0:15:04.090
+That's also
+
+0:15:04.090,0:15:06.370
+the node number
+
+0:15:06.370,0:15:09.140
+...
+
+0:15:09.140,0:15:13.880
+there are also under information you could receive
+from the DDB linked to, for example,
+
+0:15:13.880,0:15:16.980
+the bugging deadlocks,
+
+0:15:16.980,0:15:18.100
+like sleep chains,
+
+0:15:18.100,0:15:19.310
+for any
+
+0:15:19.310,0:15:24.250
+wait channel if you have the address
+
+0:15:24.250,0:15:27.150
+and, for example,
+
+0:15:27.150,0:15:32.650
+you can also print the wall table of
+the lock relations from witness
+
+0:15:32.650,0:15:38.010
+but it's mostly never useful
+because you already know that.
+
+0:15:38.010,0:15:41.100
+So you will just need to know which is the one
+
+0:15:41.100,0:15:41.980
+that
+
+0:15:41.980,0:15:43.019
+um,
+
+0:15:43.019,0:15:47.750
+can give trouble.
+
+0:15:47.750,0:15:51.640
+if you're going to say I mean some problem
+is a problem
+
+0:15:51.640,0:15:53.980
+So if you are going to submit some problems
+
+0:15:53.980,0:15:57.180
+usually called NGAP that are probably
+deadlocked in the kernel space,
+
+0:15:57.180,0:16:04.130
+that ones
+
+0:16:04.130,0:16:11.130
+are the information that we actually need.
+
+0:16:11.650,0:16:14.970
+Now,
+
+0:16:14.970,0:16:18.950
+it's very difficult to see very good reports
+about deadlocks,
+
+0:16:18.950,0:16:20.020
+so
+
+0:16:20.020,0:16:22.569
+I think that
+
+0:16:22.569,0:16:25.670
+it is a very good thing to talk about it.
+
+0:16:25.670,0:16:31.420
+Along with the witness, we have another
+mechanism that could help us,
+
+0:16:31.420,0:16:34.620
+and it's called KTR.
+
+0:16:34.620,0:16:36.100
+KTR is
+
+0:16:36.100,0:16:40.630
+basically a logger of events,
+
+0:16:40.630,0:16:42.550
+it's
+
+0:16:42.550,0:16:45.090
+highly configurable,
+
+0:16:45.090,0:16:48.280
+as you can, for example,
+handle different classes of events,
+
+0:16:48.280,0:16:53.940
+In FreeBSD we have
+
+0:16:53.940,0:16:55.130
+classes linked to the scheduler,
+
+0:16:55.130,0:16:56.290
+to the locking,
+
+0:16:56.290,0:16:58.520
+to the VFS, to callouts,
+
+0:16:58.520,0:17:05.030
+and they are all packed in the same class.
+
+0:17:05.030,0:17:08.880
+So they can be selectively enabled or masked.
+
+0:17:08.880,0:17:10.030
+For example
+
+0:17:10.030,0:17:12.190
+the difference is that you can
+enable several classes,
+
+0:17:12.190,0:17:13.610
+like
+
+0:17:13.610,0:17:16.470
+the time class of the KTR
+
+0:17:16.470,0:17:21.000
+and then you are just interested in three of them
+even if all ten of them are
+
+0:17:21.000,0:17:23.030
+actually tracked.
+
+0:17:23.030,0:17:24.240
+You will are just going to
+
+0:17:24.240,0:17:26.839
+see three of them.
+
+0:17:26.839,0:17:28.439
+Um,
+
+0:17:28.439,0:17:31.940
+an important thing is that KTR
+
+0:17:31.940,0:17:34.520
+doesn't handle,
+
+0:17:34.520,0:17:37.770
+for example
+
+0:17:37.770,0:17:38.300
+pointers,
+
+0:17:38.300,0:17:40.340
+doesn't store the information
+
+0:17:40.340,0:17:45.450
+passed to init, it just stores the pointer
+
+0:17:45.450,0:17:46.880
+and not the information,
+
+0:17:46.880,0:17:48.390
+for example,
+
+0:17:48.390,0:17:50.160
+a string,
+
+0:17:50.160,0:17:55.000
+it doesn't make copies, you need to just pass
+the pointers
+
+0:17:55.000,0:17:57.610
+which are persistent in the memory.
+
+0:17:57.610,0:18:00.340
+Otherwise,
+
+0:18:00.340,0:18:05.500
+KTR won't be able to access them.
+
+0:18:05.500,0:18:09.760
+The good thing about KTR is that
+
+0:18:09.760,0:18:11.600
+you can also look at it from the user space
+
+0:18:11.600,0:18:13.430
+through the ktrdump interface.
+
+0:18:13.430,0:18:15.820
+...
+
+0:18:15.820,0:18:17.030
+And now...
+
+0:18:17.030,0:18:19.669
+Why is that important for locking?
+
+0:18:19.669,0:18:20.279
+Because
+
+0:18:20.279,0:18:21.090
+after,
+
+0:18:21.090,0:18:24.350
+it can tell you what happed,
+
+0:18:24.350,0:18:27.260
+on which CPU branches,
+
+0:18:27.260,0:18:30.020
+and the order it happened in.
+
+0:18:30.020,0:18:34.580
+and this is very important when you're
+going to track down for example races,
+
+0:18:34.580,0:18:37.720
+when you are not sure about the order of operations and
+
+0:18:37.720,0:18:44.710
+how they happened. It's going to tell you.
+
+0:18:44.710,0:18:46.290
+For example
+
+0:18:46.290,0:18:48.650
+that is
+
+0:18:48.650,0:18:51.090
+a typical trace of KTR,
+
+0:18:51.090,0:18:52.410
+where you have
+
+0:18:52.410,0:18:56.890
+the CPU where the event happened, the index,
+
+0:18:56.890,0:18:58.620
+that's a timestamp,
+
+0:18:58.620,0:19:03.400
+I think it's retrieved directly from the TSC,
+but i'm not actually sure.
+
+0:19:03.400,0:19:04.889
+In this case,
+
+0:19:04.889,0:19:10.210
+i was tracking down the scheduler class,
+
+0:19:10.210,0:19:16.100
+so I was interested mainly in scheduler
+workloads and i could see
+
+0:19:16.100,0:19:19.210
+for example
+
+0:19:19.210,0:19:21.100
+that
+
+0:19:21.100,0:19:24.870
+a contact switch happened
+
+0:19:24.870,0:19:26.919
+scheduling
+
+0:19:26.919,0:19:28.010
+the idle CPU
+
+0:19:28.010,0:19:30.270
+and then other information,
+
+0:19:30.270,0:19:34.370
+like for example the load of the CPU 1
+
+0:19:34.370,0:19:37.190
+and scan priority boost,
+
+0:19:37.190,0:19:38.870
+like
+
+0:19:38.870,0:19:40.310
+this one
+
+0:19:40.310,0:19:46.420
+and other things.
+
+0:19:46.420,0:19:48.770
+Well
+
+0:19:48.770,0:19:50.820
+you can enable it
+
+0:19:50.820,0:19:55.280
+the option KTR, but you must handle it carefully.
+
+0:19:55.280,0:19:57.130
+For example
+
+0:19:57.130,0:20:01.990
+use an option i didn't include here,
+
+0:20:01.990,0:20:07.410
+which is the length of the buffer it uses
+to store the pointers in, it's called KTR_ENTRIES,
+
+0:20:07.410,0:20:08.360
+and you should specify
+
+0:20:08.360,0:20:09.590
+enough entries
+
+0:20:09.590,0:20:11.500
+to have a reliable tracking.
+
+0:20:11.500,0:20:13.580
+...
+
+0:20:13.580,0:20:16.780
+For example, if you are going to track a lot of events,
+
+0:20:16.780,0:20:19.100
+a short queue is not an option,
+
+0:20:19.100,0:20:22.120
+because you are going to lose some information.
+
+0:20:22.120,0:20:26.780
+A typical queue is of length 2K (2 kilobytes)
+
+0:20:26.780,0:20:29.710
+of entries.
+
+0:20:29.710,0:20:32.520
+These other options
+
+0:20:32.520,0:20:36.190
+let you compile some classes,
+
+0:20:36.190,0:20:38.370
+or mask them,
+
+0:20:38.370,0:20:43.770
+or mask a CPU,
+
+0:20:43.770,0:20:46.289
+if you have a big SMP environment,
+
+0:20:46.289,0:20:50.160
+so that you can selectively enable some of them.
+
+0:20:50.160,0:20:54.700
+For example, this is very good for
+tracking down races in the sleeping queue.
+
+0:20:54.700,0:21:01.700
+You can find referrals here.
+
+0:21:02.770,0:21:04.820
+...
+
+0:21:04.820,0:21:06.220
+So,
+
+0:21:06.220,0:21:09.020
+I will spend the last time of the speech
+speaking about possible improvements
+
+0:21:09.020,0:21:10.500
+to
+
+0:21:10.500,0:21:15.670
+our locking system, which is not very bad.
+
+0:21:15.670,0:21:16.500
+Well,
+
+0:21:16.500,0:21:21.750
+I think that actually our locking system
+is pretty complete,
+
+0:21:21.750,0:21:26.919
+but it's also confusing for newcomers,
+it's not widely documented.
+
+0:21:26.919,0:21:32.280
+so maybe we will spend a good amount of time
+on documentation.
+
+0:21:32.280,0:21:32.799
+As you can see,
+
+0:21:32.799,0:21:38.120
+even in this presentation, which is not very huge,
+
+0:21:38.120,0:21:42.540
+there are many things to say
+
+0:21:42.540,0:21:46.700
+and that are not very simple to understand in particular
+
+0:21:46.700,0:21:48.240
+for people
+
+0:21:48.240,0:21:50.280
+who just need to do simple tasks.
+
+0:21:50.280,0:21:56.660
+For example, I saw a lot of guys coming from Linux World
+
+0:21:56.660,0:22:00.620
+who wanted to actually use spinlocks for time
+
+0:22:00.620,0:22:05.720
+it's obvious they are missing something from our
+architecture
+
+0:22:05.720,0:22:07.250
+From, uh, ...
+
+0:22:07.250,0:22:11.010
+just a technical point of view,
+
+0:22:11.010,0:22:14.530
+it would be very good if we could remove
+
+0:22:14.530,0:22:20.440
+legacy support and overriding support. For example,
+
+0:22:20.440,0:22:22.900
+we have lockmgr and sxlog
+
+0:22:22.900,0:22:27.990
+which are both read/write locks and
+are both serverd by sleep queues.
+
+0:22:27.990,0:22:31.800
+They have some differences, obviously,
+
+0:22:31.800,0:22:32.660
+but, mainly,
+
+0:22:32.660,0:22:38.920
+we could manage the missing bits and
+just one of the 2 interfaces.
+
+0:22:38.920,0:22:42.059
+on the scene where he hasn't told you before
+visiting all
+
+0:22:42.059,0:22:43.920
+In the same way, as i told you before,
+
+0:22:43.920,0:22:47.340
+the sleeping point, true end sleep,
+read/write sleep and sxsleep
+
+0:22:47.340,0:22:50.350
+should probably be merged with cond vars
+
+0:22:50.350,0:22:52.930
+and superdoff our kernel
+
+0:22:52.930,0:22:55.070
+and we should probably drop sema,
+
+0:22:55.070,0:23:00.290
+because it is obsolete, and can be
+replaced by condova and mutex.
+
+0:23:00.290,0:23:02.620
+...
+
+0:23:02.620,0:23:03.830
+From
+
+0:23:03.830,0:23:05.639
+a strong technical point of view,
+
+0:23:05.639,0:23:07.350
+as you can see,
+
+0:23:07.350,0:23:09.680
+we spent a lot of time
+
+0:23:09.680,0:23:12.109
+on optimization on our blocking primitives,
+
+0:23:12.109,0:23:16.770
+but very few on our spinning primitives.
+
+0:23:16.770,0:23:21.810
+That's because obviously blocking
+primitives are our first choice,
+
+0:23:21.810,0:23:22.700
+but
+
+0:23:22.700,0:23:25.210
+spinlocks could be improved too,
+
+0:23:25.210,0:23:30.130
+using technics such as the so-called
+back-off algorithms and
+
+0:23:30.130,0:23:31.429
+queued spinlocks.
+
+0:23:31.429,0:23:33.560
+I have a patch for that but
+
+0:23:33.560,0:23:35.499
+I would really need
+
+0:23:35.499,0:23:39.999
+to test it and tune it on a big SMP environment.
+
+0:23:39.999,0:23:42.270
+I don't think
+
+0:23:42.270,0:23:43.540
+that, for now,
+
+0:23:43.540,0:23:45.730
+they can handle such an environment.
+
+0:23:45.730,0:23:47.820
+In addition,
+
+0:23:47.820,0:23:51.500
+I'm not sure you are familiar with
+queued spinlocks algorithms.
+
+0:23:51.500,0:23:55.580
+Basically a back-off algorithms try
+
+0:23:55.580,0:24:02.250
+to reduce the cache pressure on the
+
+0:24:02.250,0:24:06.090
+threads contesting on the lock
+
+0:24:06.090,0:24:08.270
+by giving a time.
+
+0:24:08.270,0:24:10.290
+Instead, the other one
+
+0:24:10.290,0:24:16.180
+uses spinning on a local variable
+which is not shared by the threads
+
+0:24:16.180,0:24:18.030
+and the time spent
+
+0:24:18.030,0:24:20.140
+on that
+
+0:24:20.140,0:24:22.780
+local variable increases
+
+0:24:22.780,0:24:25.440
+um...
+
+0:24:25.440,0:24:28.320
+...
+
+0:24:28.320,0:24:31.780
+with the passing of time.
+
+0:24:31.780,0:24:35.950
+Another interesting thing would be benchmarking
+
+0:24:35.950,0:24:37.930
+different wake-up algorithms for blocking primitives.
+
+0:24:37.930,0:24:42.390
+We have an algorithms that has proven to be
+
+0:24:42.390,0:24:42.910
+...
+
+0:24:42.910,0:24:45.200
+quite good
+
+0:24:45.200,0:24:47.440
+but
+
+0:24:47.440,0:24:51.330
+we are not confronted with other kind of
+wake-ups that could have
+
+0:24:51.330,0:24:56.450
+a higher overhead but could give time improvements
+
+0:24:56.450,0:24:59.760
+on a big SMP environment.
+
+0:24:59.760,0:25:02.500
+Um, ...
+
+0:25:02.500,0:25:07.000
+another thing that would be very interesting
+to fix is the priority inversion problem
+
+0:25:07.000,0:25:08.670
+in the case of read locks.
+
+0:25:08.670,0:25:09.790
+There is an approach
+
+0:25:09.790,0:25:13.950
+called dominant record implemented in FreeBSD,
+
+0:25:13.950,0:25:16.360
+but i saw that it's pretty
+
+0:25:16.360,0:25:22.160
+slow for our fastpack. In FreeBSD all our locking
+primitives are broken and add to path
+
+0:25:22.160,0:25:23.290
+where
+
+0:25:23.290,0:25:25.820
+the fastpack is
+
+0:25:25.820,0:25:28.620
+is often just a single atomic operation,
+
+0:25:28.620,0:25:30.010
+and
+
+0:25:30.010,0:25:33.770
+if it fails
+
+0:25:33.770,0:25:36.900
+it falls down and the art pattern tries to do
+all the end-work of the sleep queues.
+
+0:25:36.900,0:25:40.210
+in this case the owner of record technic
+was going to make the fastpack too simple
+
+0:25:40.210,0:25:42.640
+Basically
+
+0:25:42.640,0:25:46.310
+it just considers
+
+0:25:46.310,0:25:50.940
+the readets as one
+
+0:25:50.940,0:25:55.380
+and can switch in and out.
+
+0:25:55.380,0:26:02.210
+And it practically lands the priority to this
+owner of record which does it's right log.
+
+0:26:02.210,0:26:06.900
+...
+
+0:26:06.900,0:26:11.900
+Another important thing obviously is improving locking
+
+0:26:11.900,0:26:13.420
+where the
+
+0:26:13.420,0:26:15.649
+optimum approach is not chosen.
+
+0:26:15.649,0:26:21.039
+I see a lot of parts in which the
+primitives chosen by the developers
+
+0:26:21.039,0:26:23.320
+are not the most suitable ones
+
+0:26:23.320,0:26:27.690
+and we should switch to the right one.
+
+0:26:27.690,0:26:33.120
+Like, for example, in discriminated usage
+of the spinlocks
+
+0:26:33.120,0:26:34.990
+or the blocking primitives,
+
+0:26:34.990,0:26:40.430
+just to handle cases like that,
+like the one we saw before with the malloc command,
+
+0:26:40.430,0:26:44.070
+that needs to sleep.
+
+0:26:44.070,0:26:44.320
+Questions?