Discussion:
LDB python3 strings
Andrew Bartlett via samba-technical
2018-05-01 21:28:16 UTC
Permalink
G'Day Noel,

Thanks so much for continuing the python3 work. This is really
important and I'm so glad to be able to pass on the baton here.

One thing that came up in a discussion in the Catalyst office regarding
this work is worth raising more broadly.

It is exceedingly common in Samba's use of ldb to use:

username = str(res[0]["samAccountName"])

This works because of

static PyObject *py_ldb_msg_element_str(PyLdbMessageElementObject *self)
{
        struct ldb_message_element *el = pyldb_MessageElement_AsMessageElement(self);

        if (el->num_values == 1)
                return PyStr_FromStringAndSize((char *)el->values[0].data, el->values[0].length);
        else
                Py_RETURN_NONE;
}

However equally common is:

username = str(res[0]["samAccountName"][0])

This works because in python2 it just returns the string. However in
python3 I'm told it will return "b'username'" (no so helpful).

As all strings in LDAP are UTF8 (I'm willing to assert that for sanity)
I think we need the MessageElement to contain not byte buffers, but a
subclass of byte buffers that have a string function that converts
automatically produces a utf8 string for str().

Do you think you could have a look at that? Otherwise, converting
samba-tool and our other ldb-calling code is going to get very tricky.

Thanks!

Andrew Bartlett
--
Andrew Bartlett
https://samba.org/~abartlet/
Authentication Developer, Samba Team https://samba.org
Samba Development and Support, Catalyst IT
https://catalyst.net.nz/services/samba
Noel Power via samba-technical
2018-05-02 07:58:51 UTC
Permalink
Hi
Post by Andrew Bartlett via samba-technical
G'Day Noel,
Thanks so much for continuing the python3 work. This is really
important and I'm so glad to be able to pass on the baton here.
Well I hope I am not going to be alone in working on this and I hope
everyone who was also contributing will still do so, I don't really have
the background knowledge (or even python skills) but I'm happy to keep
pushing on as best and as hard as I can
Post by Andrew Bartlett via samba-technical
One thing that came up in a discussion in the Catalyst office regarding
this work is worth raising more broadly.
username = str(res[0]["samAccountName"])
This works because of
static PyObject *py_ldb_msg_element_str(PyLdbMessageElementObject *self)
{
        struct ldb_message_element *el = pyldb_MessageElement_AsMessageElement(self);
        if (el->num_values == 1)
                return PyStr_FromStringAndSize((char *)el->values[0].data, el->values[0].length);
        else
                Py_RETURN_NONE;
}
Not always :-/ It seems some attributes are not strings e.g. guids can
be binary also same for security descriptors. These can fail with
str(res[0]["blah"]) as there could easily be a decode error before even
the py c code returns (I've even had to deal with this in my WIP)
Post by Andrew Bartlett via samba-technical
username = str(res[0]["samAccountName"][0])
probably more common is just the plain res[0]["samAccountName"][0] the
str doesn't do anything in this case I think and the majority of the
code I have seen doesn't enclose the value in the 'str' function
Post by Andrew Bartlett via samba-technical
This works because in python2 it just returns the string. However in
python3 I'm told it will return "b'username'" (no so helpful).
As all strings in LDAP are UTF8 (I'm willing to assert that for sanity)
I think we need the MessageElement to contain not byte buffers, but a
subclass of byte buffers that have a string function that converts
automatically produces a utf8 string for str().
not sure exactly what you mean here because doesn't decode provide the
same functionality?
   e.g. res[0]["samAccountName"][0].decode('utf8')

or do you mean change the api so that 'res[0]["samAccountName"][0]' will
now return an object that provides a 'str' method *and* additionally
some sort or a 'to_bytes' [1] type method this would mean we would have
to modify

-  res[0]["blah"][0]'
+  str(res[0]["blah"][0])'

with the exception of those attributes that we require binary content
for where they would have to

-  res[0]['binaryAttr'][0]
+ res[0]['binaryAttr'][0].to_bytes()'

However there doesn't seem really to be much difference in effort here
than just adding the decode where necessary like

-  res[0]['blah'][0]
+ res[0]['blah][0].decode('uft8')

Now I readily admit I am not really a python programmer nor have really
a huge amount of knowledge of the samba python api so I guess I am
missing something ?

Also if anyone has an easy list of what attributes definitely have
binary content that would be useful
Post by Andrew Bartlett via samba-technical
Do you think you could have a look at that? Otherwise, converting
samba-tool and our other ldb-calling code is going to get very tricky.
yep, I am already experiencing that, I've already converted a hunk of
the samba_tool tests (those exercising the api) to python3 (you can see
the progress https://github.com/samba-team/samba/pull/161 - please note,
this is a WIP branch, there's only a pull request for visibility and CI
exposure) The string/binary issue around attributes is annoying. I'd
welcome any more input, suggestions or other possible solution there.

Noel

[1] I expected python3 to provide a 'tp_bytes' type c-function hook,
afaik in native python you can define a '__bytes__' method. However this
doesn't seem to be the case.
Andrew Bartlett via samba-technical
2018-05-02 08:08:38 UTC
Permalink
Post by Noel Power via samba-technical
Hi
Post by Andrew Bartlett via samba-technical
G'Day Noel,
Thanks so much for continuing the python3 work. This is really
important and I'm so glad to be able to pass on the baton here.
Well I hope I am not going to be alone in working on this and I hope
everyone who was also contributing will still do so, I don't really have
the background knowledge (or even python skills) but I'm happy to keep
pushing on as best and as hard as I can
OK. I had hoped from the enthusiasm that you might have had a bit more
background, but I still appreciate the efforts. This has totally
exhausted too many engineers so far!
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
One thing that came up in a discussion in the Catalyst office regarding
this work is worth raising more broadly.
username = str(res[0]["samAccountName"])
This works because of
static PyObject *py_ldb_msg_element_str(PyLdbMessageElementObject *self)
{
struct ldb_message_element *el = pyldb_MessageElement_AsMessageElement(self);
if (el->num_values == 1)
return PyStr_FromStringAndSize((char *)el->values[0].data, el->values[0].length);
else
Py_RETURN_NONE;
}
Not always :-/ It seems some attributes are not strings e.g. guids can
be binary also same for security descriptors. These can fail with
str(res[0]["blah"]) as there could easily be a decode error before even
the py c code returns (I've even had to deal with this in my WIP)
Sure, I was kind of happy for that to give an error, as that is just
programmer error.
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
username = str(res[0]["samAccountName"][0])
probably more common is just the plain res[0]["samAccountName"][0] the
str doesn't do anything in this case I think and the majority of the
code I have seen doesn't enclose the value in the 'str' function
yes, but then we have to add a decode(), right?
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
This works because in python2 it just returns the string. However in
python3 I'm told it will return "b'username'" (no so helpful).
As all strings in LDAP are UTF8 (I'm willing to assert that for sanity)
I think we need the MessageElement to contain not byte buffers, but a
subclass of byte buffers that have a string function that converts
automatically produces a utf8 string for str().
not sure exactly what you mean here because doesn't decode provide the
same functionality?
e.g. res[0]["samAccountName"][0].decode('utf8')
Yes, but that means changing a lot of code.
Post by Noel Power via samba-technical
or do you mean change the api so that 'res[0]["samAccountName"][0]' will
now return an object that provides a 'str' method *and* additionally
some sort or a 'to_bytes' [1] type method this would mean we would have
to modify
- res[0]["blah"][0]'
+ str(res[0]["blah"][0])'
with the exception of those attributes that we require binary content
for where they would have to
- res[0]['binaryAttr'][0]
+ res[0]['binaryAttr'][0].to_bytes()'
However there doesn't seem really to be much difference in effort here
than just adding the decode where necessary like
- res[0]['blah'][0]
+ res[0]['blah][0].decode('uft8')
Now I readily admit I am not really a python programmer nor have really
a huge amount of knowledge of the samba python api so I guess I am
missing something ?
I was sort of hoping it would be some kind of weird polymorphic thing
that behaved like a string or bytes in the same way python2 did given
we know it is utf8 if string-ish.
Post by Noel Power via samba-technical
Also if anyone has an easy list of what attributes definitely have
binary content that would be useful
I don't think we can assert that, but there are conventions.
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Do you think you could have a look at that? Otherwise, converting
samba-tool and our other ldb-calling code is going to get very tricky.
yep, I am already experiencing that, I've already converted a hunk of
the samba_tool tests (those exercising the api) to python3 (you can see
the progress https://github.com/samba-team/samba/pull/161 - please note,
this is a WIP branch, there's only a pull request for visibility and CI
exposure) The string/binary issue around attributes is annoying. I'd
welcome any more input, suggestions or other possible solution there.
OK, I'll try and find some time and I'll ask Joe to keep up looking at
this, he has the strong python background that is critical here.

Thanks!

Andrew Bartlett
Post by Noel Power via samba-technical
Noel
[1] I expected python3 to provide a 'tp_bytes' type c-function hook,
afaik in native python you can define a '__bytes__' method. However this
doesn't seem to be the case.
--
Andrew Bartlett http://samba.org/~abartlet/
Authentication Developer, Samba Team http://samba.org
Samba Developer, Catalyst IT http://catalyst.net.nz/services/samba
Noel Power via samba-technical
2018-05-02 12:01:30 UTC
Permalink
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Hi
Post by Andrew Bartlett via samba-technical
G'Day Noel,
Thanks so much for continuing the python3 work. This is really
important and I'm so glad to be able to pass on the baton here.
Well I hope I am not going to be alone in working on this and I hope
everyone who was also contributing will still do so, I don't really have
the background knowledge (or even python skills) but I'm happy to keep
pushing on as best and as hard as I can
OK. I had hoped from the enthusiasm that you might have had a bit more
background, but I still appreciate the efforts. This has totally
exhausted too many engineers so far!
I think with some 'enthusiasm', communication, cooperation + review we
can achieve alot :-) It's pretty hard to know everything required for
this, python experts typically know little about samba, even people who
know samba may not know the python apis :-). I know just a little about
alot ;-) I'm hoping in this case that it is more useful than knowing
alot about little.
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
One thing that came up in a discussion in the Catalyst office regarding
this work is worth raising more broadly.
username = str(res[0]["samAccountName"])
This works because of
static PyObject *py_ldb_msg_element_str(PyLdbMessageElementObject *self)
{
struct ldb_message_element *el = pyldb_MessageElement_AsMessageElement(self);
if (el->num_values == 1)
return PyStr_FromStringAndSize((char *)el->values[0].data, el->values[0].length);
else
Py_RETURN_NONE;
}
Not always :-/ It seems some attributes are not strings e.g. guids can
be binary also same for security descriptors. These can fail with
str(res[0]["blah"]) as there could easily be a decode error before even
the py c code returns (I've even had to deal with this in my WIP)
[...]
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
This works because in python2 it just returns the string. However in
python3 I'm told it will return "b'username'" (no so helpful).
As all strings in LDAP are UTF8 (I'm willing to assert that for sanity)
I think we need the MessageElement to contain not byte buffers, but a
subclass of byte buffers that have a string function that converts
automatically produces a utf8 string for str().
not sure exactly what you mean here because doesn't decode provide the
same functionality?
e.g. res[0]["samAccountName"][0].decode('utf8')
Yes, but that means changing a lot of code.
I am not sure that can be avoided, it might not be a bad thing for the
code to be changed so that string/byte usage is correct and unambiguous
(using the correct encoding/decoding etc).
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
or do you mean change the api so that 'res[0]["samAccountName"][0]' will
now return an object that provides a 'str' method *and* additionally
some sort or a 'to_bytes' [1] type method this would mean we would have
to modify
- res[0]["blah"][0]'
+ str(res[0]["blah"][0])'
with the exception of those attributes that we require binary content
for where they would have to
- res[0]['binaryAttr'][0]
+ res[0]['binaryAttr'][0].to_bytes()'
However there doesn't seem really to be much difference in effort here
than just adding the decode where necessary like
- res[0]['blah'][0]
+ res[0]['blah][0].decode('uft8')
Now I readily admit I am not really a python programmer nor have really
a huge amount of knowledge of the samba python api so I guess I am
missing something ?
I was sort of hoping it would be some kind of weird polymorphic thing
that behaved like a string or bytes in the same way python2 did given
we know it is utf8 if string-ish.
I don't see how we can get away with the fact that in python3 there are
hard code boundaries with various api's (native python & samba) that
expect either 'bytes' or a 'str'. The decision needs to be made before
interacting with these apis to provide the correct type. I don't see how
this can be done automagically (but if someone can point to a way that
can be done I am all ears)
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Also if anyone has an easy list of what attributes definitely have
binary content that would be useful
I don't think we can assert that, but there are conventions.
Assuming code needs to be manually adjusted this information would be
useful in order to identify those places that need to be changed
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Do you think you could have a look at that? Otherwise, converting
samba-tool and our other ldb-calling code is going to get very tricky.
yep, I am already experiencing that, I've already converted a hunk of
the samba_tool tests (those exercising the api) to python3 (you can see
the progress https://github.com/samba-team/samba/pull/161 - please note,
this is a WIP branch, there's only a pull request for visibility and CI
exposure) The string/binary issue around attributes is annoying. I'd
welcome any more input, suggestions or other possible solution there.
I have to stress that this is WIP, typically I am trying to work through
the tests, get them working (under CI) and then come back to revaluate
the changes needed, try and identify generic changes and then
reorganise/rebase + pull groups of these changes out for review in
separate pull requests/mails to the list.
Post by Andrew Bartlett via samba-technical
OK, I'll try and find some time and I'll ask Joe to keep up looking at
this, he has the strong python background that is critical here.
Thanks!
Andrew Bartlett
thanks again
Noel
William Brown via samba-technical
2018-05-02 21:38:15 UTC
Permalink
On Wed, 2018-05-02 at 13:01 +0100, Noel Power via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Hi
Post by Andrew Bartlett via samba-technical
G'Day Noel,
Thanks so much for continuing the python3 work. This is really
important and I'm so glad to be able to pass on the baton here.
Well I hope I am not going to be alone in working on this and I hope
everyone who was also contributing will still do so, I don't really have
the background knowledge (or even python skills) but I'm happy to keep
pushing on as best and as hard as I can
OK. I had hoped from the enthusiasm that you might have had a bit more
background, but I still appreciate the efforts. This has totally
exhausted too many engineers so far!
I think with some 'enthusiasm', communication, cooperation + review we
can achieve alot :-) It's pretty hard to know everything required for
this, python experts typically know little about samba, even people who
know samba may not know the python apis :-). I know just a little about
alot ;-) I'm hoping in this case that it is more useful than knowing
alot about little.
I've just come out of a python2 -> python3 work for an ldap library in
389 Directory Server and can agree it's a lot of work, but with some
"community spirit" you can achieve a lot indeed.
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
One thing that came up in a discussion in the Catalyst office regarding
this work is worth raising more broadly.
username = str(res[0]["samAccountName"])
This works because of
static PyObject
*py_ldb_msg_element_str(PyLdbMessageElementObject *self)
{
struct ldb_message_element *el =
pyldb_MessageElement_AsMessageElement(self);
if (el->num_values == 1)
return PyStr_FromStringAndSize((char *)el-
Post by Andrew Bartlett via samba-technical
values[0].data, el->values[0].length);
else
Py_RETURN_NONE;
}
Not always :-/ It seems some attributes are not strings e.g. guids can
be binary also same for security descriptors. These can fail with
str(res[0]["blah"]) as there could easily be a decode error before even
the py c code returns (I've even had to deal with this in my WIP)
[...]
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
This works because in python2 it just returns the
string. However in
python3 I'm told it will return "b'username'" (no so helpful).
As all strings in LDAP are UTF8 (I'm willing to assert that for sanity)
I think we need the MessageElement to contain not byte buffers, but a
subclass of byte buffers that have a string function that converts
automatically produces a utf8 string for str().
not sure exactly what you mean here because doesn't decode
provide the
same functionality?
e.g. res[0]["samAccountName"][0].decode('utf8')
Yes, but that means changing a lot of code.
I am not sure that can be avoided, it might not be a bad thing for the
code to be changed so that string/byte usage is correct and
unambiguous
(using the correct encoding/decoding etc).
This is a really annoying one to solve. In the lib389 code instead of
putting 'decode' everywhere, we actually added some wrappers onto our
object management code that allowed you to get a type in the manner you
expect.

get_utf8
get_bytes
get_int

Samba 4 could really use:

get_ndr_unpacked()

Which would help to shortcut all the ndr_unpack calls too :)

This was really helpful because there *are* cases where you want bytes,
utf8 or int - perhaps others.


So instead, you could change this to:

res[0].get_utf8('samaccountname')[0]

As an idea.
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
or do you mean change the api so that
'res[0]["samAccountName"][0]' will
now return an object that provides a 'str' method *and*
additionally
some sort or a 'to_bytes' [1] type method this would mean we would have
to modify
- res[0]["blah"][0]'
+ str(res[0]["blah"][0])'
with the exception of those attributes that we require binary content
for where they would have to
- res[0]['binaryAttr'][0]
+ res[0]['binaryAttr'][0].to_bytes()'
However there doesn't seem really to be much difference in effort here
than just adding the decode where necessary like
- res[0]['blah'][0]
+ res[0]['blah][0].decode('uft8')
Now I readily admit I am not really a python programmer nor have really
a huge amount of knowledge of the samba python api so I guess I am
missing something ?
I was sort of hoping it would be some kind of weird polymorphic thing
that behaved like a string or bytes in the same way python2 did given
we know it is utf8 if string-ish.
I don't see how we can get away with the fact that in python3 there are
hard code boundaries with various api's (native python & samba) that
expect either 'bytes' or a 'str'. The decision needs to be made before
interacting with these apis to provide the correct type. I don't see how
this can be done automagically (but if someone can point to a way that
can be done I am all ears)
You can use mypy which is a strict type checker for python to help
enforce some of these at "compile time".
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Also if anyone has an easy list of what attributes definitely have
binary content that would be useful
I don't think we can assert that, but there are conventions.
Assuming code needs to be manually adjusted this information would be
useful in order to identify those places that need to be changed
Isn't this able to be determined from the schema ....

Anyway, as above: the best solution we found was just to have the
client request the "type it expected", and if it's wrong ... well,
should have asked for those bytes as an int not a string (for example).
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Do you think you could have a look at that? Otherwise,
converting
samba-tool and our other ldb-calling code is going to get very tricky.
yep, I am already experiencing that, I've already converted a hunk of
the samba_tool tests (those exercising the api) to python3 (you can see
the progress https://github.com/samba-team/samba/pull/161 - please note,
this is a WIP branch, there's only a pull request for visibility and CI
exposure) The string/binary issue around attributes is annoying. I'd
welcome any more input, suggestions or other possible solution there.
I have to stress that this is WIP, typically I am trying to work through
the tests, get them working (under CI) and then come back to
revaluate
the changes needed, try and identify generic changes and then
reorganise/rebase + pull groups of these changes out for review in
separate pull requests/mails to the list.
Post by Andrew Bartlett via samba-technical
OK, I'll try and find some time and I'll ask Joe to keep up looking at
this, he has the strong python background that is critical here.
I'm happy to look at an review these also :)
Post by Noel Power via samba-technical
Post by Andrew Bartlett via samba-technical
Thanks!
Andrew Bartlett
thanks again
Noel
Loading...