dbwrap_record_watch

Discussion:

dbwrap_record_watch_send/recv

Volker Lendecke

2012-02-15 15:59:41 UTC

Hi!

Under

http://git.samba.org/?p=vl/samba.git/.git;a=shortlog;h=refs/heads/dbwrap_record_watch

find a patchset that I've been working on for a while now.

It implements the following API:

struct tevent_req *dbwrap_record_watch_send(TALLOC_CTX *mem_ctx,
struct tevent_context *ev,
struct db_record *rec,
struct messaging_context *msg);
NTSTATUS dbwrap_record_watch_recv(struct tevent_req *req,
TALLOC_CTX *mem_ctx,
struct db_record **prec);

The central idea is that you can asynchronously wait for a
dbwrap based tdb record to change. The top commit in the git

This simplifies the g_lock implementation. The new
implementation tries to acquire a lock. If that fails due
to a lock conflict, wait for the g_lock record to change.
Upon change, just try again. The old logic had to cope
with pending records and an ugly hack into ctdb itself. As
a bonus, we now get a really clean async
g_lock_lock_send/recv that can asynchronously wait for a
global lock. This would have been almost impossible to do
without the dbwrap_record_watch infrastructure.

Just for the g_lock implementation it would not have been
worth the trouble to implement that API and the
infrastructure around it, but if you look at our share mode
an oplock implementation, a lot of the custom smbd messages
can be replaced by the new API. For example we have a
special message to inform a second opener about an oplock
being released. This can be simplified by sending a message
to the oplock holder and then watching the record to change.
After a change, just retry. Same holds true for timed byte
range locks and a few others.

What fell out of this work is the start of a reworked
messaging API, we now have msg_read_send/recv, a tevent_req
based version of messaging_register.

One consequence of using this API throughout smbd would be a
vastly improved cleanup behaviour after a crashed smbd.
Right now we have custom code to periodically walk the
brlock database. We do not have code to walk locking.tdb,
for good reason. You just don't want to traverse a database
of 100.000 open files when maybe 100 of those are waiting
for an oplock break. By using the dbwrap_watchers.tdb for
everyone waiting for a change, it becomes much more feasible
to walk this whole db and wake up all waiters whenever an
smbd dies or a node goes down.

This patchset is not perfect yet: One example piece missing
is proper cleanup right now. The code does not yet clean up
stale entries when a waiter dies hard.

Comments?

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

Christian Ambach

2012-02-16 12:35:15 UTC

Permalink

Post by Volker Lendecke
One consequence of using this API throughout smbd would be a
vastly improved cleanup behaviour after a crashed smbd.
Right now we have custom code to periodically walk the
brlock database. We do not have code to walk locking.tdb,
for good reason. You just don't want to traverse a database
of 100.000 open files when maybe 100 of those are waiting
for an oplock break. By using the dbwrap_watchers.tdb for
everyone waiting for a change, it becomes much more feasible
to walk this whole db and wake up all waiters whenever an
smbd dies or a node goes down.

This piece sounds very promising! (especially for the clustered case)

Cheers,
Christian

Stefan (metze) Metzmacher

2012-02-17 09:06:55 UTC

Permalink

Hi Volker,

Post by Volker Lendecke
Under
http://git.samba.org/?p=vl/samba.git/.git;a=shortlog;h=refs/heads/dbwrap_record_watch
Comments?

That looks like a really nice infrastructure!
It'll really help a lot to cleanup our code.

I have just a few cosmetic remarks:

When first reading the name 'msg_stream', I'm a bit confused, because
it could imply that we're using some sort of stream sockets.
After reading the code my impression that 'msg_channel' could be a
better name,
but I'm not a native speaker...(Native speakers please comment on this)
What do you think about this?

Did you intend to add msg_write_send/recv also based on a 'struct
msg_stream' (or whatever name it endup with)
later?

For msg_stream_send/recv I propose to add '_setup' in the function name
'msg_stream_setup_send/recv'.

Why does dbwrap_record_watch_send() use the sync msg_stream() function
instead of msg_stream_send/recv?

metze

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20120217/856f6c47/attachment.pgp>

Michael Adam

2012-02-17 11:58:37 UTC

Permalink

Post by Stefan (metze) Metzmacher

Post by Volker Lendecke
http://git.samba.org/?p=vl/samba.git/.git;a=shortlog;h=refs/heads/dbwrap_record_watch
Comments?

That looks like a really nice infrastructure!
It'll really help a lot to cleanup our code.

Indeed! The cleanup and simplification of the g_lock code
with this mechanism is already impressive. It is a really
good and natural abstraction of the various mechanisms for
waiting for a record entry to become available, etc.
I also especially like the clean (and async) new
ctdb_conn (replacement for ctdbd_conn).

Post by Stefan (metze) Metzmacher
When first reading the name 'msg_stream', I'm a bit confused, because
it could imply that we're using some sort of stream sockets.
After reading the code my impression that 'msg_channel' could be a
better name,
but I'm not a native speaker...(Native speakers please comment on this)
What do you think about this?

Not being a native speaker either, I think this is a really good
proposal. I was also somewhat confused by the name. For me, the name
"msg_channel" seems to nicely reflect the concept implemented.

Cheers - Michael

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20120217/db64d6f2/attachment.pgp>

Volker Lendecke

2012-02-17 15:01:28 UTC

Permalink

Post by Stefan (metze) Metzmacher

Post by Volker Lendecke
Under
http://git.samba.org/?p=vl/samba.git/.git;a=shortlog;h=refs/heads/dbwrap_record_watch
Comments?

That looks like a really nice infrastructure!
It'll really help a lot to cleanup our code.
When first reading the name 'msg_stream', I'm a bit confused, because
it could imply that we're using some sort of stream sockets.
After reading the code my impression that 'msg_channel' could be a
better name,
but I'm not a native speaker...(Native speakers please comment on this)
What do you think about this?

Maybe you're right. Others?

Post by Stefan (metze) Metzmacher
Did you intend to add msg_write_send/recv also based on a 'struct
msg_stream' (or whatever name it endup with)
later?

msg_stream right now is bound to a msg_type. This would not
be appropriate for sending messages I think.

Post by Stefan (metze) Metzmacher
For msg_stream_send/recv I propose to add '_setup' in the function name
'msg_stream_setup_send/recv'.

Ok, sounds reasonable.

Post by Stefan (metze) Metzmacher
Why does dbwrap_record_watch_send() use the sync msg_stream() function
instead of msg_stream_send/recv?

I had that first. But it does not work, because you do not
want to unlock the record you want to watch before you are
sure to receive all messages correctly, i.e. you need to
wait for the msg_stream. But this would mean that we enter
the main event loop with a locked record, something which I
would REALLY like to avoid.

Acquiring a msg_stream is async right now because we have to
register a SRVID with ctdb. This in theory can block, sure.
But I am planning to severely tune ctdb_conn anyway.
Possibly I would have just one single socket into ctdb and
hide this fact behind the API. We could then cache the
registrations and discard messages that we have registered
for with ctdb but for which we don't have any
ctdb_msg_stream's for right now. If we did that, normally
acquiring the msg_stream would in fact be non-blocking.

This is a dirty corner of that patchset, but I did not find
a clean way around that problem yet. If you have any good
ideas, I would be very interested to hear about them.

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

Christian Ambach

2012-02-17 15:05:27 UTC

Permalink

Post by Volker Lendecke

Maybe you're right. Others?

I had the same thought about stream sockets when looking at the name the
first time. So I would concur on giving this a less confusing name.

Cheers,
Christian