Discussion:
Patch to support Scalable CTDB
(too old to reply)
Partha Sarathi via samba-technical
2018-04-27 00:14:54 UTC
Permalink
Raw Message
Hi,

We have a requirement to support a large cluster i.e scaling from 20 nodes
to 50 nodes cluster and the current CTDB design may not support the linear
scaling of nodes in the cluster as the replication of all lock
related tdbs and tdb traversing for every record may slow down the
performance.

The product requirement for creating large number nodes say 50 in a
cluster with subgrouping them into three/five nodes into
multiple protocol heads. Each of these protocol head group nodes will host
a specific set of shares not all. So we took an approach to create two
instances of CTDBD on each node.

1) The primary ctdbd (Persistent ctdbd) is responsible to just replicate
the persistent TDBs across the cluster in our case 50 nodes to maintain the
AD registered account details and supporting single global namespace across
the large cluster.

2) The secondary instance called ( Locking ctdbd) is responsible
to replicate and traverse the lock related TDBs within the protocol heads
group in that way reducing the latency TDB transactions (expensive when the
number of nodes is large) by just communicating within the limited nodes
group.

3) Smbd changed in such a way that it communicates against these two
instance of CTDBDs with different ctdbd sockets. The message inits and the
connection handling have been well-taken care.

To have the above ctdbd running independently they are configured
separately and listening on different ctdb ports (4379 and 4380)
respectively.



With the attached patch, I could see and get the expected things running as
below

***@oneblox1:/opt/exablox/config# ctdb -s
/var/run/ctdb/persistent_ctdbd.socket getdbmap
Number of databases:7
dbid:0x88d96253 name:smbpasswd.tdb
path:/exablox/pctdb/persistent/smbpasswd.tdb.1 PERSISTENT
dbid:0x2ca251cf name:account_policy.tdb
path:/exablox/pctdb/persistent/account_policy.tdb.1 PERSISTENT
dbid:0xa1413774 name:group_mapping.tdb
path:/exablox/pctdb/persistent/group_mapping.tdb.1 PERSISTENT
dbid:0xc3078fba name:share_info.tdb
path:/exablox/pctdb/persistent/share_info.tdb.1 PERSISTENT
dbid:0x6cf2837d name:registry.tdb
path:/exablox/pctdb/persistent/registry.tdb.1 PERSISTENT
*dbid:0x7132c184 name:secrets.tdb
path:/exablox/pctdb/persistent/secrets.tdb.1 PERSISTENT*
*dbid:0x6645c6c4 name:ctdb.tdb path:/exablox/pctdb/persistent/ctdb.tdb.1
PERSISTENT*


***@oneblox1:/opt/exablox/config# ctdb -s /var/run/ctdb/ctdbd.socket
getdbmap
Number of databases:11
dbid:0x477d2e20 name:smbXsrv_client_global.tdb
path:/var/tmp/samba/ctdb/smbXsrv_client_global.tdb.1
dbid:0x66f71b8c name:smbXsrv_open_global.tdb
path:/var/tmp/samba/ctdb/smbXsrv_open_global.tdb.1
dbid:0x9ec2a880 name:serverid.tdb path:/var/tmp/samba/ctdb/serverid.tdb.1
dbid:0x06916e77 name:leases.tdb path:/var/tmp/samba/ctdb/leases.tdb.1
*dbid:0x7a19d84d name:locking.tdb path:/var/tmp/samba/ctdb/locking.tdb.1*
dbid:0x4e66c2b2 name:brlock.tdb path:/var/tmp/samba/ctdb/brlock.tdb.1
dbid:0x68c12c2c name:smbXsrv_tcon_global.tdb
path:/var/tmp/samba/ctdb/smbXsrv_tcon_global.tdb.1
dbid:0x6b06a26d name:smbXsrv_session_global.tdb
path:/var/tmp/samba/ctdb/smbXsrv_session_global.tdb.1
dbid:0x521b7544 name:smbXsrv_version_global.tdb
path:/var/tmp/samba/ctdb/smbXsrv_version_global.tdb.1
dbid:0x4d2a432b name:g_lock.tdb path:/var/tmp/samba/ctdb/g_lock.tdb.1
*dbid:0x6645c6c4 name:ctdb.tdb path:/exablox/ctdb/persistent/ctdb.tdb.1
PERSISTENT*


Please review the patch and let me know your comments.

Regards,
--Partha
Amitay Isaacs via samba-technical
2018-04-27 08:37:36 UTC
Permalink
Raw Message
Hi Partha,

On Fri, Apr 27, 2018 at 10:14 AM, Partha Sarathi via samba-technical
Post by Partha Sarathi via samba-technical
Hi,
We have a requirement to support a large cluster i.e scaling from 20 nodes
to 50 nodes cluster and the current CTDB design may not support the linear
scaling of nodes in the cluster as the replication of all lock
related tdbs and tdb traversing for every record may slow down the
performance.
There are many areas in CTDB that require work to improve scalability
to large number of nodes. Many of the improvements are on the
roadmap.

<shameless promo>
One of the major improvements is to split the monolithic daemon into
separate daemons. Martin and I have been doing lots of ground work to
get to a point where we can start introducing separate daemons. There
will be lots of patches appearing on the mailing list soon to that
effect. This will eventually get us to leaner database daemon(s).
</shameless promo>
Post by Partha Sarathi via samba-technical
The product requirement for creating large number nodes say 50 in a
cluster with subgrouping them into three/five nodes into
multiple protocol heads. Each of these protocol head group nodes will host
a specific set of shares not all. So we took an approach to create two
instances of CTDBD on each node.
1) The primary ctdbd (Persistent ctdbd) is responsible to just replicate
the persistent TDBs across the cluster in our case 50 nodes to maintain the
AD registered account details and supporting single global namespace across
the large cluster.
2) The secondary instance called ( Locking ctdbd) is responsible
to replicate and traverse the lock related TDBs within the protocol heads
group in that way reducing the latency TDB transactions (expensive when the
number of nodes is large) by just communicating within the limited nodes
group.
3) Smbd changed in such a way that it communicates against these two
instance of CTDBDs with different ctdbd sockets. The message inits and the
connection handling have been well-taken care.
To have the above ctdbd running independently they are configured
separately and listening on different ctdb ports (4379 and 4380)
respectively.
It's an interesting hack. But I would not recommend running multiple
instances of ctdb daemon. Among the many reasons is "ctdb daemon is
still running with real-time". You definitely don't want multiple
user-space daemons running at real-time priority. Additionally, two
ctdb instances create unnecessary network overhead of double the
communication for two separate ctdb clusters.

One approach for solving this problem would be VNNMAP groups.

VNNMAP is a collection of nodes which participate in database
activity. Even though it's applicable to both the persistent and the
volatile databases, it has more effect on the volatile databases.
Volatile databases are the distributed temporary databases (e.g.
locking.tdb). Currently all the nodes are in a single VNNMAP.

With VNNMAP groups, we can partition the nodes into groups. Each group
then maintains the volatile databases independently from the other
group. Of course samba configuration (share definitions) must to be
identical for all the nodes in a group. Also, samba shares across two
different groups cannot have overlapping file system directories
(unless they are read-only shares). This should effectively reproduce
the same behaviour you have achieved with two ctdb instances, but
without needing any change in samba.

Amitay.
Partha Sarathi via samba-technical
2018-04-27 15:31:27 UTC
Permalink
Raw Message
Thanks, Amitay for your feedback on the patch and the scalability of CTDB.

How do we do the VNNMAP groups, could you please give an example of doing
that.

Regards,
--Partha
Post by Amitay Isaacs via samba-technical
Hi Partha,
On Fri, Apr 27, 2018 at 10:14 AM, Partha Sarathi via samba-technical
Post by Partha Sarathi via samba-technical
Hi,
We have a requirement to support a large cluster i.e scaling from 20
nodes
Post by Partha Sarathi via samba-technical
to 50 nodes cluster and the current CTDB design may not support the
linear
Post by Partha Sarathi via samba-technical
scaling of nodes in the cluster as the replication of all lock
related tdbs and tdb traversing for every record may slow down the
performance.
There are many areas in CTDB that require work to improve scalability
to large number of nodes. Many of the improvements are on the
roadmap.
<shameless promo>
One of the major improvements is to split the monolithic daemon into
separate daemons. Martin and I have been doing lots of ground work to
get to a point where we can start introducing separate daemons. There
will be lots of patches appearing on the mailing list soon to that
effect. This will eventually get us to leaner database daemon(s).
</shameless promo>
Post by Partha Sarathi via samba-technical
The product requirement for creating large number nodes say 50 in a
cluster with subgrouping them into three/five nodes into
multiple protocol heads. Each of these protocol head group nodes will
host
Post by Partha Sarathi via samba-technical
a specific set of shares not all. So we took an approach to create two
instances of CTDBD on each node.
1) The primary ctdbd (Persistent ctdbd) is responsible to just replicate
the persistent TDBs across the cluster in our case 50 nodes to maintain
the
Post by Partha Sarathi via samba-technical
AD registered account details and supporting single global namespace
across
Post by Partha Sarathi via samba-technical
the large cluster.
2) The secondary instance called ( Locking ctdbd) is responsible
to replicate and traverse the lock related TDBs within the protocol heads
group in that way reducing the latency TDB transactions (expensive when
the
Post by Partha Sarathi via samba-technical
number of nodes is large) by just communicating within the limited nodes
group.
3) Smbd changed in such a way that it communicates against these two
instance of CTDBDs with different ctdbd sockets. The message inits and
the
Post by Partha Sarathi via samba-technical
connection handling have been well-taken care.
To have the above ctdbd running independently they are configured
separately and listening on different ctdb ports (4379 and 4380)
respectively.
It's an interesting hack. But I would not recommend running multiple
instances of ctdb daemon. Among the many reasons is "ctdb daemon is
still running with real-time". You definitely don't want multiple
user-space daemons running at real-time priority. Additionally, two
ctdb instances create unnecessary network overhead of double the
communication for two separate ctdb clusters.
One approach for solving this problem would be VNNMAP groups.
VNNMAP is a collection of nodes which participate in database
activity. Even though it's applicable to both the persistent and the
volatile databases, it has more effect on the volatile databases.
Volatile databases are the distributed temporary databases (e.g.
locking.tdb). Currently all the nodes are in a single VNNMAP.
With VNNMAP groups, we can partition the nodes into groups. Each group
then maintains the volatile databases independently from the other
group. Of course samba configuration (share definitions) must to be
identical for all the nodes in a group. Also, samba shares across two
different groups cannot have overlapping file system directories
(unless they are read-only shares). This should effectively reproduce
the same behaviour you have achieved with two ctdb instances, but
without needing any change in samba.
Amitay.
--
Thanks & Regards
-Partha
Martin Schwenke via samba-technical
2018-04-28 06:41:50 UTC
Permalink
Raw Message
Hi Partha,

On Fri, 27 Apr 2018 08:31:27 -0700, Partha Sarathi via samba-technical
Post by Partha Sarathi via samba-technical
How do we do the VNNMAP groups, could you please give an example of doing
that.
Just in case Amitay was unclear, VNNMAP groups is a feature that
doesn't exist in CTDB yet. We were talking about the problem you're
trying to solve and Amitay came up with the term. :-)

To make the idea more concrete, we would associate a VNNMAP group
number with each node address. One possible implementation would be to
have an optional VNNMAP group for each address in the nodes file. The
default group, when left unspecified, could be 0. So, the default case
would leave all nodes in group 0 and a cluster would behave the way it
does now. However, if some nodes are in a different group then the
behaviour changes.

This tells us how VNNMAP groups would be used but not how it would be
implemented.

The implementation needs to change all of the places in the code where
actions involving volatile databases use the VNNMAP and ensure that the
(local) VNNMAP for the group is used instead. Operations on
persistent/replicated databases involve all active nodes. The
operations on volatile databases would include calculation of LMASTER
when a record is being migrated, recovery, traverse, vacuuming, ...

For example, if a node triggers a recovery then volatile databases would
only be recovered within the VNNMAP group but persistent/replicated
databases would be recovered cluster-wide.

This would be a very neat feature!

peace & happiness,
martin
Ralph Böhme via samba-technical
2018-04-28 14:33:43 UTC
Permalink
Raw Message
Hi Partha,
Post by Partha Sarathi via samba-technical
We have a requirement to support a large cluster i.e scaling from 20 nodes
to 50 nodes cluster and the current CTDB design may not support the linear
scaling of nodes in the cluster as the replication of all lock
related tdbs and tdb traversing for every record may slow down the
performance.
hm, which itch are you actually trying to scratch here?

I guess there are two scalability problems in the context of volatile dbs (eg
locking.tdb) in ctdb: record contention and vacuuming. Traverses are normally
not an issue as smbd doesn't request traverses of volatile dbs in production. Am
I missing something? Is vacuuming really a problem? Amitay?

-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
Partha Sarathi via samba-technical
2018-04-29 17:29:51 UTC
Permalink
Raw Message
Hi Ralph,

Thanks, for checking on this.

These are the slide deck contents of 2016 SNIA conference by Volker.

======
locking.tdb Scalability
Locking.tdb is Samba's central store for open file handle information.
* Every file open/close goes through locking.tdb
* One record per inode carries all share modes, leases, etc
* Modern clients \\server\share directory handle open
* Nonclustered Samba copes with it, although records get large
* Clustered locking.tdb record becomes "hot", bouncing between nodes
Post by Partha Sarathi via samba-technical
This particular reason "itched me" :-)
* For the share root directory, you might cheat
* Assign per-node fake device number for \
* No record bouncing, no cross share modes
* phonebook.exe still a problem
* Split up locking.tdb per node and global component >>>>> So I
started scratching :-)
* Only store the strictest share mode in ctdb
* Keep individual share modes local per node.

======

I hope my understanding is correct and I started trying to solve the above
problem, if not please correct me and shed some light and please
suggest/provided an alternate solution to scale CTDB in the large node
cluster.

Regards,
--Partha
Hi Partha,
On Thu, Apr 26, 2018 at 05:14:54PM -0700, Partha Sarathi via
Post by Partha Sarathi via samba-technical
We have a requirement to support a large cluster i.e scaling from 20
nodes
Post by Partha Sarathi via samba-technical
to 50 nodes cluster and the current CTDB design may not support the
linear
Post by Partha Sarathi via samba-technical
scaling of nodes in the cluster as the replication of all lock
related tdbs and tdb traversing for every record may slow down the
performance.
hm, which itch are you actually trying to scratch here?
I guess there are two scalability problems in the context of volatile dbs (eg
locking.tdb) in ctdb: record contention and vacuuming. Traverses are normally
not an issue as smbd doesn't request traverses of volatile dbs in production. Am
I missing something? Is vacuuming really a problem? Amitay?
-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
--
Thanks & Regards
-Partha
Volker Lendecke via samba-technical
2018-04-30 06:43:51 UTC
Permalink
Raw Message
Post by Partha Sarathi via samba-technical
* Clustered locking.tdb record becomes "hot", bouncing between nodes
This particular reason "itched me" :-)
Is that really solved by partitioning of the ctdb cluster? I could
imagine that if you share the same file space you will see broken
locking by subcluster A not seeing the locks of subcluster B. If you
separate out the filespaces, you don't win much except less nodes
beating the same files. Does that really gain enough performance?

Volker
--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:***@sernet.de
Ralph Böhme via samba-technical
2018-04-30 09:46:45 UTC
Permalink
Raw Message
Post by Volker Lendecke via samba-technical
Post by Partha Sarathi via samba-technical
* Clustered locking.tdb record becomes "hot", bouncing between nodes
This particular reason "itched me" :-)
Is that really solved by partitioning of the ctdb cluster? I could
imagine that if you share the same file space you will see broken
locking by subcluster A not seeing the locks of subcluster B. If you
separate out the filespaces, you don't win much except less nodes
beating the same files. Does that really gain enough performance?
also, I guess partitioning the filespace should be possible without partitioning
the cluster at the ctdb level at all.

-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
Partha Sarathi via samba-technical
2018-04-30 14:36:14 UTC
Permalink
Raw Message
Ok. My concern is when you have common Ctdb running across cluster with
different file spaces keeping the locking.tdb replicated for all file opens
doesn’t seems to be worth.

So we decided to split the ctdb along with the file space to keep them
isolated and not interfering with other subcluster. But as I said we needed
the single global namespace for user and we have to keep secrets.tdb
consistent across all the nodes irrespective of different filespace.

—Partha
On Mon, Apr 30, 2018 at 08:43:51AM +0200, Volker Lendecke via
On Sun, Apr 29, 2018 at 10:29:51AM -0700, Partha Sarathi via
Post by Partha Sarathi via samba-technical
* Clustered locking.tdb record becomes "hot", bouncing between
nodes
Post by Partha Sarathi via samba-technical
This particular reason "itched me" :-)
Is that really solved by partitioning of the ctdb cluster? I could
imagine that if you share the same file space you will see broken
locking by subcluster A not seeing the locks of subcluster B. If you
separate out the filespaces, you don't win much except less nodes
beating the same files. Does that really gain enough performance?
also, I guess partitioning the filespace should be possible without partitioning
the cluster at the ctdb level at all.
-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
--
Thanks & Regards
-Partha
Volker Lendecke via samba-technical
2018-04-30 14:52:13 UTC
Permalink
Raw Message
Post by Partha Sarathi via samba-technical
Ok. My concern is when you have common Ctdb running across cluster with
different file spaces keeping the locking.tdb replicated for all file opens
doesn’t seems to be worth.
locking.tdb is never actively replicated. The records are only ever
moved on demand to the nodes that actively request it. That's
different from secrets.tdb for example.

Volker
--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:***@sernet.de
Partha Sarathi via samba-technical
2018-04-30 23:12:01 UTC
Permalink
Raw Message
On Mon, Apr 30, 2018 at 02:36:14PM +0000, Partha Sarathi via
Post by Partha Sarathi via samba-technical
Ok. My concern is when you have common Ctdb running across cluster with
different file spaces keeping the locking.tdb replicated for all file
opens
Post by Partha Sarathi via samba-technical
doesn’t seems to be worth.
locking.tdb is never actively replicated. The records are only ever
moved on demand to the nodes that actively request it. That's
different from secrets.tdb for example.
Volker
--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
Thanks, Volker and Ralph, but I see different behavior.

I had three node cluster running a common ctdb with different filespace on
each of the nodes as below

pnn:1 fe80::ec4:7aff:fe34:ac0b OK
pnn:2 fe80::ec4:7aff:fe34:ee47 OK
pnn:3 fe80::ec4:7aff:fe34:b923 OK (THIS NODE)
Generation:1740147135
Size:3
hash:0 lmaster:1
hash:1 lmaster:2
hash:2 lmaster:3
Recovery mode:NORMAL (0)
Recovery master:2


1) Opened a "1.pdf" on node 1 and noticed couple records updated in
"locking.tdb.1" and also in node 2 "locking.tdb.2".
2) Opened a "3.pdf" on node 3 and noticed couple records updated in
"locking.tdb.3" and also in both "locking.tdb.2" and locking.tdb.1

Per your statement what I was expecting was unless any node specifically
request for the records, it shouldn't have to get those records. but in the
above example, even without asking all the records were available on all
the nodes. Basically, one more understanding what I learned is, every node
in the cluster try to update their open/close file records to Recovery
master in the large cluster with different filespace it may be overwhelmed
with all record updates unnecessarily.

The below is the locking.tdb dumps on all the three nodes for different
file open/close but with lcoking.tdb had all the records on all the nodes.
The open records for file "3.pdf" was not necessary on node 1, but Recovery
master had those records, so it updated/replicated to rest of the nodes in
the cluster.

So this kind of cluster-wide replications may slow down the overall
performance when you are trying to open a large number of file with
different file spaces in subclusters.

***@oneblox40274:/var/tmp/samba/ctdb# tdbdump locking.tdb.1
{
key(24) =
"\16\00\00\00\00\00\00\00c\0C\01\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\BA\EF\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"%\00\00\00\00\00\00\00\E8\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(480) =
"\02\00\00\00\00\00\00\00\01\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\A7\02\DDg\AD\C3\E0\A4\00\00\02\00\04\00\02\00\00\00\00\00\02\00\00\00\02\00\00\00\00\00\00\00\ACt\00\00\00\00\00\00\00\00\00\00\01\00\00\00\B1\0CZ\C4\E2\F92X\C6\00\00\00\00\00\00\00\00\00\00\00\FF\FF\FF\FF\81\00\10\00\07\00\00\00\00\00\00\00\00\00\00\00;\81\E7Z\00\00\00\00e\22\03\00%\00\00\00\00\00\00\00\E8\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\CA\F23\F9\00\00\00\00\FE\FF\00\00\00\00\00\00\A6\EB\F4'\00\00\00\00\00\00\00\00\ACt\00\00\00\00\00\00\00\00\00\00\01\00\00\00\B1\0CZ\C4\E2\F92Xm\13\00\00\00\00\00\00\00\00\00\00\FF\FF\FF\FF\81\00\10\00\07\00\00\00\00\00\00\00\00\00\00\00\C3\84\E7Z\00\00\00\00\F8\10\06\00%\00\00\00\00\00\00\00\E8\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00C\B5\DB\0B\00\00\00\00\FE\FF\00\00\00\00\00\00\A6\EB\F4'\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00%\00\00\00\00\00\00\00\E8\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\10\00\00\00\00\00\00\00\10\00\00\00/exports/Public\00\02\00\00\00\00\00\00\00\02\00\00\00.\00\00\00\00\00\00\00;\81\E7Z\00\00\00\00e\22\03\00\00\00\00\00\C3\84\E7Z\00\00\00\00\F8\10\06\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00b\0C\01\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00~\04\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\EAL\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) = "\16\00\00\00\00\00\00\00\87@
\02\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\B1\F9\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\92>\02\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\1B5\02\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\CB\08\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00t\FE\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00z\11\02\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\D6Y\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\9E\06\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00r\09\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"#\00\00\00\00\00\00\00\B1d\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00\B0!\02\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
{
key(24) =
"\16\00\00\00\00\00\00\00=\0C\01\00\00\00\00\00\00\00\00\00\00\00\00\00"
data(24) =
"\01\00\00\00\00\00\00\00\03\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
}
--
Thanks & Regards
-Partha
Martin Schwenke via samba-technical
2018-05-01 01:37:58 UTC
Permalink
Raw Message
Hi Partha,

On Mon, 30 Apr 2018 16:12:01 -0700, Partha Sarathi via samba-technical
Post by Partha Sarathi via samba-technical
On Mon, Apr 30, 2018 at 02:36:14PM +0000, Partha Sarathi via
Post by Partha Sarathi via samba-technical
Ok. My concern is when you have common Ctdb running across cluster with
different file spaces keeping the locking.tdb replicated for all file
opens
Post by Partha Sarathi via samba-technical
doesn’t seems to be worth.
locking.tdb is never actively replicated. The records are only ever
moved on demand to the nodes that actively request it. That's
different from secrets.tdb for example.
Thanks, Volker and Ralph, but I see different behavior.
I had three node cluster running a common ctdb with different filespace on
each of the nodes as below
pnn:1 fe80::ec4:7aff:fe34:ac0b OK
pnn:2 fe80::ec4:7aff:fe34:ee47 OK
pnn:3 fe80::ec4:7aff:fe34:b923 OK (THIS NODE)
Generation:1740147135
Size:3
hash:0 lmaster:1
hash:1 lmaster:2
hash:2 lmaster:3
Recovery mode:NORMAL (0)
Recovery master:2
1) Opened a "1.pdf" on node 1 and noticed couple records updated in
"locking.tdb.1" and also in node 2 "locking.tdb.2".
2) Opened a "3.pdf" on node 3 and noticed couple records updated in
"locking.tdb.3" and also in both "locking.tdb.2" and locking.tdb.1
Per your statement what I was expecting was unless any node specifically
request for the records, it shouldn't have to get those records. but in the
above example, even without asking all the records were available on all
the nodes. Basically, one more understanding what I learned is, every node
in the cluster try to update their open/close file records to Recovery
master in the large cluster with different filespace it may be overwhelmed
with all record updates unnecessarily.
The below is the locking.tdb dumps on all the three nodes for different
file open/close but with lcoking.tdb had all the records on all the nodes.
The open records for file "3.pdf" was not necessary on node 1, but Recovery
master had those records, so it updated/replicated to rest of the nodes in
the cluster.
So this kind of cluster-wide replications may slow down the overall
performance when you are trying to open a large number of file with
different file spaces in subclusters.
I don't think you're seeing records in volatile database
being replicated. However, there is a simple explanation for what
you're seeing, especially on a 3 node cluster!

As others have said, the volatile databases are distributed.

Unfortunately the diagrams at:

https://wiki.samba.org/index.php/Samba_%26_Clustering#Finding_the_DMASTER

are wrong. I have a new diagram but need to discuss with people
whether the above should be kept as a historical document or whether I
should update.

CTDB uses 2 (relatively :-) simple concepts for doing the distribution:

* DMASTER (or data master)

This is the node that has the most recent copy of a record.

The big question is: How can you find this DMASTER? The answer is...

* LMASTER (or location master)

This node always knows which node is DMASTER.

The LMASTER for a record is calculated by hashing the record key and
then doing a modulo of the number of active, LMASTER-capable nodes
and then mapping this to a node number via the VNNMAP.

Let's say you have 3 nodes (A, B, C) and node A wants a
particular record. Let's say that node B is the LMASTER for that
record.

There are 3 cases, depending on which node is DMASTER:

* DMASTER is A

smbd will find the record locally. No migration is necessary. The
LMASTER is not consulted.

* DMASTER is B

A will ask B for the record. B will notice that it is DMASTER and
will forward the record to A. The record will be updated on both A
and B because the change of DMASTER must be recorded.

* DMASTER is C

A will ask B for the record. B will notice that it is not DMASTER
and forward the request to C. C forwards the record to B, which
forwards it to A. The record will be updated on A, B and C because
the change of DMASTER must be recorded.

You can now add nodes D, E, F, ... and they will not affect migration
of the record (if there is no contention for the record from those
additional nodes).

If there is heavy contention for a record then 2 different performance
issues can occur:

* High hop count

Before C gets the request from node B, C responds to a migration
request from another node and is no longer DMASTER for the record.
C must then forward the request back to the LMASTER. This can go on
for a while. CTDB logs this sort of behaviour and keeps statistics.

* Record migrated away before smbd gets it

The record is successfully migrated to node A and ctdbd informs the
requesting smbd that the record is there. However, before smbd can
grab the record, a request is processed to migrate the record to
another node. smbd looks, notices that node A is not DMASTER and
must once again ask ctdbd to migrate the record. smbd may log if
there are multiple attempts to migrate a record.

Try this

git grep attempts source3/lib/dbwrap

to get an initial understand of what is logged and what the
parameters are. :-)

Read-only and sticky records are existing features that may help to
counteract contention.

peace & happiness,
martin
Ralph Böhme via samba-technical
2018-05-03 18:03:41 UTC
Permalink
Raw Message
Post by Ralph Böhme via samba-technical
https://wiki.samba.org/index.php/Samba_%26_Clustering#Finding_the_DMASTER
are wrong. I have a new diagram but need to discuss with people
whether the above should be kept as a historical document or whether I
should update.
please keep the reference and create a slick new page with the correct
diagram. Bonus points if you add the text below. Either way, I'll owe you a pot
of best green tea! :)
Post by Ralph Böhme via samba-technical
* DMASTER (or data master)
This is the node that has the most recent copy of a record.
The big question is: How can you find this DMASTER? The answer is...
* LMASTER (or location master)
This node always knows which node is DMASTER.
The LMASTER for a record is calculated by hashing the record key and
then doing a modulo of the number of active, LMASTER-capable nodes
and then mapping this to a node number via the VNNMAP.
Let's say you have 3 nodes (A, B, C) and node A wants a
particular record. Let's say that node B is the LMASTER for that
record.
* DMASTER is A
smbd will find the record locally. No migration is necessary. The
LMASTER is not consulted.
* DMASTER is B
A will ask B for the record. B will notice that it is DMASTER and
will forward the record to A. The record will be updated on both A
and B because the change of DMASTER must be recorded.
* DMASTER is C
A will ask B for the record. B will notice that it is not DMASTER
and forward the request to C. C forwards the record to B, which
forwards it to A. The record will be updated on A, B and C because
the change of DMASTER must be recorded.
You can now add nodes D, E, F, ... and they will not affect migration
of the record (if there is no contention for the record from those
additional nodes).
If there is heavy contention for a record then 2 different performance
* High hop count
Before C gets the request from node B, C responds to a migration
request from another node and is no longer DMASTER for the record.
C must then forward the request back to the LMASTER. This can go on
for a while. CTDB logs this sort of behaviour and keeps statistics.
* Record migrated away before smbd gets it
The record is successfully migrated to node A and ctdbd informs the
requesting smbd that the record is there. However, before smbd can
grab the record, a request is processed to migrate the record to
another node. smbd looks, notices that node A is not DMASTER and
must once again ask ctdbd to migrate the record. smbd may log if
there are multiple attempts to migrate a record.
Thanks!
-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
Martin Schwenke via samba-technical
2018-05-04 02:49:10 UTC
Permalink
Raw Message
Post by Ralph Böhme via samba-technical
Post by Ralph Böhme via samba-technical
https://wiki.samba.org/index.php/Samba_%26_Clustering#Finding_the_DMASTER
are wrong. I have a new diagram but need to discuss with people
whether the above should be kept as a historical document or whether I
should update.
please keep the reference and create a slick new page with the correct
diagram. Bonus points if you add the text below. Either way, I'll owe you a pot
of best green tea! :)
Yeah, I wrote that description nice and carefully so I could copy
and paste it to the wiki. Beer me, please! :-)

I added this

https://wiki.samba.org/index.php/CTDB_database_design

Linked from:

https://wiki.samba.org/index.php?title=CTDB_design

which is linked from:

https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba

peace & happiness,
martin
Ralph Böhme via samba-technical
2018-05-04 07:12:56 UTC
Permalink
Raw Message
Post by Martin Schwenke via samba-technical
Post by Ralph Böhme via samba-technical
please keep the reference and create a slick new page with the correct
diagram. Bonus points if you add the text below. Either way, I'll owe you a pot
of best green tea! :)
Yeah, I wrote that description nice and carefully so I could copy
and paste it to the wiki. Beer me, please! :-)
whatever you wish, at your service!
Post by Martin Schwenke via samba-technical
I added this
https://wiki.samba.org/index.php/CTDB_database_design
https://wiki.samba.org/index.php?title=CTDB_design
https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba
Awesome, thanks!

-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
Ralph Böhme via samba-technical
2018-04-30 15:07:18 UTC
Permalink
Raw Message
Post by Partha Sarathi via samba-technical
Ok. My concern is when you have common Ctdb running across cluster with
different file spaces keeping the locking.tdb replicated for all file opens
doesn’t seems to be worth.
as already pointed out by Volker, locking.tdb is not replicated. It's a so
called volatile tdb that uses an approach which is described in an older -- but
still mostly correct -- design doc here:

<https://wiki.samba.org/index.php/Samba_%26_Clustering#Finding_the_DMASTER>

-slow
--
Ralph Boehme, Samba Team https://samba.org/
Samba Developer, SerNet GmbH https://sernet.de/en/samba/
GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5
59E4 AA1E 9B71 2639 9E46
Loading...