Conversation
|
|
||
| @classmethod | ||
| def from_segments(cls, s, separator=None): | ||
| return s.split(separator) if separator else s.split() |
There was a problem hiding this comment.
I find this more obfuscation than convenience. Isn't
cls.from_segments(s.split('_'))more explicit than
cls.from_segments(s, separator='_')? I.e. if the caller already knows what to split on, then they should do it right away.
There was a problem hiding this comment.
One aspect that I would say should not be forgotten here is that we would have two situations that happen a lot in calling the function:
- There is a string, so we want to split by the separator. Normally, actually ALWAYS, it is a
+by which we split, so we havecls.from_segments(string)as the convenient normal case. If we have to uses.split(" + ")here also always, we end up writing many more lines. - There is a list, if we read in data from cldf, where the split is done on
due to the way we handle the data as a list there.
There was a problem hiding this comment.
So if we want to say the from_segments is dealing with Segments in Lingpy aka TOKENS and CLDF aka CLDF_Segments, I'd consider it advantageous to have a check if it is a list and then revert it. But I know this is may obfuscate it even more.
But the handling with separator as kw is something I consider an urgent convenience, since we have the default here, which we'd otherwise have to invoke ALWAYS via s.split(" + ").
There was a problem hiding this comment.
Hm. Maybe we should have even more factory methods? I want to avoid the "seems to work" situations, i.e. situations where you are not forced to think about what your input actually looks like - yet something seems to happen and you just accept the results. Having separate methods that only accept one datatype as input force you to think about this - and allow tools like PyCharm to help you with this.
There was a problem hiding this comment.
Another advantage of additional methods is that methods have docstrings, so we get a canonical place where to document the clever things we might do to manipulate input :)
There was a problem hiding this comment.
@classmethod
def from_segments(cls, s):
"""
only accepts list!
"""Is different from __init__.
Here, we have a list like ["p", "a", "+", "t", "e", "r"]. But we want internally [["p", "a"], ["t", "e", "r"]].
There was a problem hiding this comment.
One way to address this is " ".join(["p", "a", "+", "t", "e", "r"]).split(" + "), but I guess I would prefer a direct solution by iterating over the list and then splitting.
import itertools # using groupby
split_by = lambda lst: [list(group) for k, group in itertools.groupby(lst, lambda x: x == "+") if not k]Example:
>>> split_by("p a t + e r + e r".split())
[['p', 'a', 't'], ['e', 'r'], ['e', 'r']]There was a problem hiding this comment.
I would not use lambda, it was to show how this works. I got the solution after checking again on itertools, looking for the opposit of itertools.chain and then I found this blog demonstrating the solution.
There was a problem hiding this comment.
@LinguList you are right, thank you for pointing that out. I have not thought about this case - then, it does seem reasonable to me to have three factory methods (all of which require some sort of preprocessing before calling __init__ and allow for a custom separator), as you guys have suggested. I can quickly implement that :)
There was a problem hiding this comment.
And I think itertools.groupby is better than using a hand-forged solution.
|
|
||
| def reversed_segments(self): | ||
| return Word([m[::-1] for m in self[::-1]]) | ||
|
|
There was a problem hiding this comment.
Looks good to me.
What do we do with the other changes in this PR?
There was a problem hiding this comment.
That's up to you to decide, of course ;) but since I didn't touch any of the existing methods, I think it would be no harm to keep the from_segments method around -- or do you suggest doing something differently?
As discussed in #31 , here is my proposal for explicitly handling data that already comes segmented and must not be segmented any further. I also included the functionality of passing custom separators and made whitespace handling more flexible.