Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement FileNodeData classes #19

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

blu-base
Copy link
Contributor

Large Chunk ahead =)

This implements the FileNode Types regarding #15 specified in Seciton 2.5 of the MS-ONESTORE spec.
Fortunately, I could transfer a lot of code from my playground.

Furthermore, this PR contains some changes to related classes:

  • FileNode:
    • reserved the abbreviation FND for the FIleNodeData classes, and renamed the FileNodeChunkReference to FNCR,
    • changed most enums to enum classes
    • renamed get_size() and get_location() to cb() and stp(), respectively, to match the labeling from the spec.
  • JCID:
    • adding its own parsing, and default constructor
  • libone_utils:
    • declare operator>> for uints and ints
    • copy-paste read methods for signed ints.

Moreover, this PR contains the changes presented in #17. I used those changes for conveniently implement the parsing. However, this is not absolutely necessary. To keep the option of not using overloaded operators, all parsing methods are kept public and can be used in conventional fashion. This will require to modify the parsing methods only, but should be straight forward.

This PR definitely contains bugs, since i noticed a seg fault when running the converters. I haven't tested the PR enough yet, but wanted to give some material to discuss. That way, we can implement an other coding style, if you feel this PR's to be insufficiently compatible.

I won't mind if assessing the code is going to take a month or two or whatever; no stressing-out!

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 22, 2020

This is indeed quite big! I'm also getting some segfaults but we'll see that on the way, I don't find it that important for the moment.

Let's do that bit by bit. I started with the first commit (6da0f78): could you split that up to have some intermediate steps which would help in the digestibility of the changes?

I'd imagine it being split up in a list of commits like this:

  • libone_utils: uint{8,16,32,64}_t stream parser operators and their callers
  • FileNodeChunkReference: enum {stp,cb}_format -> enum class {Stp,Cb}Format
  • FileNode: enum fnd_id -> enum class FndId (and its to_string() method)
  • Finally, the boilerplate setting and hooking up IFileNodeData (one change I would suggest would be to rename the fileNodeData folder to FileNodeData, as I got so much used to reading it with full CamelCase typing, as it is like that in the reference, it seems out of place to see it start with a small letter.. but that's more of a cosmetic change)

This will need to be rebased anyway as I only started reading this. I can help you in splitting it up if you get stuck! Don't hesitate to get in touch :)

As the changes are also quite big, it would be nice to have more descriptive commit messages telling what is being done. (but that's not important for the moment, as this is WIP anyway)

I'll continue getting the hang of the rest of the changes (and hope that my initial suggestions won't be invalidated by later changes) :)

@tshikaboom
Copy link
Owner

At a more thorough look, this will need rebasing anyway. At a high level, the code itself seems good structure-wise (I'm quite fond of this interface usage and derived types implementing it).

I'd like to see commits which at least build on every step ("atomic changes"?), which would help in bisecting regressions afterwards. I'll have a go at rebasing all the commits after 6da0f78 myself and will get back some results, if that's okay with you.

@tshikaboom
Copy link
Owner

I did a pass on rebasing this myself (including 6da0f78), I've got a (very) crude branch up in https://github.com/tshikaboom/libone/tree/filenode-rebased, on which the diff result should be minimal compared to this branch. That branch is going to be subject to force-pushes.

The method I followed was to try to separate logical changes and get any/all commits to be compilable (this helps a lot with bisecting). I did some progress, but it is still not in an usable state.

I saw there were changes not directly related to this branch (but which this branch uses), I think a good starting point would be to extract any changes not directly related to this feature and get them independently merged, so as to minimize the diff on the rest of this branch (ie. the interesting stuff). This can include all the stuff I mentioned in #19 (comment), but would not be limited to that

I'll continue on cleaning this up tomorrow.

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 23, 2020

So, I found a bug in FileNode: the stp variable initialized at the beginning of the parse function is set to the FileNode's location (and not to where the to-be-parsed FNCR is going to be pointing to). I remember having this same error in the past :)
We'd need:

  • the FileNode's location and size (this is what the m_offset and m_size_in_file did before, and I think we need to store these manually). This is what is given by input->tell() at the beginning of the parse function, and the 13-bit Size field in the FileNode's header.
  • The data referenced by the FileNode (this is what m_fncr should be doing), parsed just after the header.

This gets me to the the file node skipping where you asked what it is used for. I originally implemented it to test FileNodeListFragment parsing, i.e. to skip any "unsupported" node. This was convenient to jump past the node to the next one. As far as I could tell, it did not skip every second file node, as the function did seek exactly past the file node itself (i.e. the start of the next file node, which can be inferred as being m_offset + m_size_in_file of the previous file node).

I also got to a certain point in extracting useful changes in the branch, to cut it to smaller changes easier to review. I think the first four commits in my branch (https://github.com/tshikaboom/libone/tree/filenode-rebased) are in a good enough shape to be used/reviewed, you can check these out if you want. (although I think the Stp/Cb refactoring could use some more work...)

I also noticed you're adding the fileNodeData files in src/lib/meson.build: I found that using the files() predicate in src/lib/fileNodeData/meson.build works well to add these sources to libone_source_files in Meson. This then gets called by the subdir() predicate in src/lib/meson.build. (there is a commit in my branch which does that, which you can steal of course)

@blu-base
Copy link
Contributor Author

Hi Oskar,

thanks for looking into this! =)

I am a bit lost though, where would you like me to continue? It's obvious you have put already work into disecting the commits in this branch https://github.com/tshikaboom/libone/tree/filenode-rebased. Would you like to have PRs to that branch, or would you prefer to have approved commits in here?

So, I found a bug in FileNode: the stp variable initialized at the beginning of the parse function is set to the FileNode's location (and not to where the to-be-parsed FNCR is going to be pointing to). I remember having this same error in the past :)

This gets me to the the file node skipping where you asked what it is used for. I originally implemented it to test FileNodeListFragment parsing, i.e. to skip any "unsupported" node. This was convenient to jump past the node to the next one. As far as I could tell, it did not skip every second file node, as the function did seek exactly past the file node itself (i.e. the start of the next file node, which can be inferred as being m_offset + m_size_in_file of the previous file node).

Thanks for your insight. I'll look into this. Maybe i missed a seek somewhere and come to the impression. But which FNCR do you mean? Not every FileNode does reference to a remote location. So the m_fncr is only a container for the FileNode resource location. If the respective 'FileNode' is supposed to hold a reference to some remote resource, than the 'FileNodeData' classes should hold that information, as far as i understood. I saw the wish to use a FNCR object instead of stp and cb (or offset, and size_in_file), that's why i replaced this structure, to get the fnd abbreviation for the actual content of the FileNode. My intention was that m_fncr should only be a stream pointer for FileNodes. I see this FNCR as duplicate information though, since FileNode's header does contain most information (except the offset).

I'd imagine it being split up in a list of commits like this:

libone_utils: uint{8,16,32,64}_t stream parser operators and their callers

FileNodeChunkReference: enum {stp,cb}_format -> enum class {Stp,Cb}Format
FileNode: enum fnd_id -> enum class FndId (and its to_string() method)

I intended to make the enum more accessible, so i defined a enum class. And since it's a class, CamelCase is suggested

fileNodeData folder to FileNodeData, as I got so much used to reading it with full CamelCase typing, as it is like that in the reference, it seems out of place to see it start with a small letter.. but that's more of a cosmetic change)

Ok, this will be fixed. I had the intention to extrapolate the case of the other directory names, which also are lower case at the beginning.


Moreover, i continued to play with the source a bit. I think, the seg fault is caused by deleting the child object pointers after the parents, which again try to delete their children's pointer.

I am not enough experienced yet to deal with pointers smoothly. Maybe this could be addressed with more smart pointer usage... but again, this makes thinks more complicated again. I'll need to review this topic in more in-depth, though.

@tshikaboom
Copy link
Owner

I am a bit lost though, where would you like me to continue? It's obvious you have put already work into disecting the commits in this branch https://github.com/tshikaboom/libone/tree/filenode-rebased. Would you like to have PRs to that branch, or would you prefer to have approved commits in here?

Anything you prefer! For ease, I'd say let's keep this branch for new commits and as a playground of sorts, to see what's working and what isn't, and extract anything useful in a more proper form to other branches which would get PR'd as we go. Eventually, this PR would get closed as everything would then be integrated, in one way or another. I dunno, let's see how it goes, it will surely need some rebasing at one point too. For the moment, extracting the diffs does not seem too cumbersome for me.
(The other branch will be subject to much heavier rebasing than this, so there'd be not much of a point in doing PRs there...)

But really, anywhere you push or PR is fine by me, just keep in mind that there will be rebases somewhere at one point :)

Thanks for your insight. I'll look into this. Maybe i missed a seek somewhere and come to the impression. But which FNCR do you mean? Not every FileNode does reference to a remote location. So the m_fncr is only a container for the FileNode resource location. If the respective 'FileNode' is supposed to hold a reference to some remote resource, than the 'FileNodeData' classes should hold that information, as far as i understood. I saw the wish to use a FNCR object instead of stp and cb (or offset, and size_in_file), that's why i replaced this structure, to get the fnd abbreviation for the actual content of the FileNode. My intention was that m_fncr should only be a stream pointer for FileNodes. I see this FNCR as duplicate information though, since FileNode's header does contain most information (except the offset).

Yeah, I'm wondering myself too about this. If i understand you correctly, we could actually get rid of the m_fncr member in class FileNode, as it will be passed anyway to the FileNodeData objects. In that case, this could mean that we'd only create the FileNodeChunkReference in parse() or parse_header(), then pass it to the to-be-created FileNodeData objects. Which would also mean less confusion about offsets in FileNode, as we'd only have to deal with the FileNode offset and size itself in the class definition, and we could get the FileNodeData's location and offset (stp, cb) from something such as IFileNodeData::get_fncr() or get_location(), or anything really? Something like this?

Moreover, i continued to play with the source a bit. I think, the seg fault is caused by deleting the child object pointers after the parents, which again try to delete their children's pointer.

I am not enough experienced yet to deal with pointers smoothly. Maybe this could be addressed with more smart pointer usage... but again, this makes thinks more complicated again. I'll need to review this topic in more in-depth, though.

This is a path both of us will have to go through, it seems :) I'm not that fluent in C++ pointers either. We'll get there one day!

@tshikaboom
Copy link
Owner

Yeah, I'm wondering myself too about this. If i understand you correctly, we could actually get rid of the m_fncr member in class FileNode, as it will be passed anyway to the FileNodeData objects. In that case, this could mean that we'd only create the FileNodeChunkReference in parse() or parse_header(), then pass it to the to-be-created FileNodeData objects. Which would also mean less confusion about offsets in FileNode, as we'd only have to deal with the FileNode offset and size itself in the class definition, and we could get the FileNodeData's location and offset (stp, cb) from something such as IFileNodeData::get_fncr() or get_location(), or anything really? Something like this?

I'll have a go at refactoring FileNode like this, on top of what you did with the FileNodeChunkReference refactoring, to get a more clear idea about all this, maybe I'll have something before this weekend. Just keep me posted if you're already on it, so we don't step on each others' toes and do redundant work :)

@blu-base blu-base force-pushed the filenodedata branch 2 times, most recently from 5327412 to fd73115 Compare August 24, 2020 23:38
@blu-base
Copy link
Contributor Author

This branch now reflects/picked most rebases you proposed. I tried to keep commits cleaner now.
Some large chunks remain: 3b8dba4 094a408 a51688b

(Btw, there was a small bug in ObjectDeclarationWithRefBody, initially it read 8 bytes of data after m_oid, but its the respective tail structure is only 6 bytes long.

A few changes from the original PR are not yet recovered. I'll likely finish this tomorrow evening.

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 25, 2020

I had some progress on my side in rebasing too: I've got the fourth commit (FNCR/FileNode changes) to a point where there are no "functional" changes anymore, and kept them in separate commits afterwards, which I'll use as inspiration to simplify the code of class FileNode itself (which should not be that important for this PR in particular, but I'd like to at least keep the spirit of your changes and not just discard them, so I'll do that). I think the changes in the commit are straightforward enough to be merged, as I kept out actual logic changes.

Doing this, I also got rid of the FileNode offset bug introduced before and reactivated file node skipping. And weirdly enough I don't get any segfaults anymore! I also get the same number of lines when doing one2raw <file.one> | wc -l between current git master and filenode-rebased (built with debug messages activated). I guess I pushed away too many logical changes or something... the delete in ~FileNode() was also kept, so I can't really explain how I got to this point.

Nothing has changed (I think) in the other commits (besides those to-be-redone commits, about which I'll do something later on), so you won't have to look at them.

@tshikaboom tshikaboom mentioned this pull request Aug 25, 2020
@blu-base blu-base force-pushed the filenodedata branch 5 times, most recently from d70e458 to 4606453 Compare August 27, 2020 18:14
@blu-base
Copy link
Contributor Author

Oh well, this took way more energy than anticipated...

I tried to keep the commits cleaner now, as suggested. Moreover, i tried to keep most logic intact.
Only the FileNode's parse_header method is merged with the main parse method, so there is no additional need to weirdly set the m_offset value (but ultimately, a disputable change).

The current state also does not change all occurrences of get_location and get_size as i initially did. Only the FileNodeChunkReference receives a change, since m_stp and m_cb are not pointing to the location of that FileNodeChunkReference but a chunk.

There are also two bugs fixed in this PR. The first one was initially present: 6bd80cd, the FileNodeChunkReference was not multiplying a compressed value after parsing it to store it in the uncompressed member value. The second one was located in src/lib/meson.build where the subdir command was after the respective meson target declaration. It added the FileNodeData files after the declaration effectively loosing this information at compilation.

However, i still have an error introduced which i haven't fully located yet. Something puts the parsing to the wrong offset after the first FileNodeListFragment. This will take me some more time. I will try not change the current commit history any further till you have had a chance of inspecting it. Though i likely add further commits to fix this issue.


I had put some more time into libmson. Again, i see this as playground, or quick-and-dirty prototyping.

However, i managed to proceed a good deal. It has now an executable which converts a OneNote file into a xml structure.
This xml structure is not yet MS-ONE since i miss some MS-ONE document abstraction. Instead it writes the xml representing mostly the in memory relations---quite raw, though... I can "find" already text content. The next major task is to build the abstraction layer for ObjectSpaces.

Trying this tool out might be interesting for you, too, i guess. At least as comparison.
After compiling, in the <build>/src/ is a one2xml executable. With the argument -fyou can define to be parsed files, or -d to convert a whole directory. Such as one2xml -f Section\ 1.one.

There are a number of issues though. one2xml doesn't respect the current work directory yet. So, you need to call one2xml while being in the same directoy. Moreover, i did not print individual FileNodeListFragments into the xml, only lists of FileNodes. Yet, some content is still duplicate. It might be because of different revisions.

As much as libmson is a toy, it brought me closer to understanding the MS-ONESTORE spec. And, we can already discover undocumented PropertyIDs.


Furthermore, did you read about the update of qt-creator to 4.13? It now has a native meson project manager. Although i often use vim, i do like qt-creator using to some extend.

@blu-base
Copy link
Contributor Author

There is a chance, you might not like this latest commits. I did not succeed in fully understanding the flow of parsing the FileNodeListFragment in the current master... maybe other issues interfered. That's why there are couple of additions since my previous post.

The segfaults i got originated from the destructor of FileNode when it is pushed back into the vector within the FileNodeList/FileNodeListFragment parse methods. I came to the conclusion, that the implicit, default constructors might not be well structured. That's why FileNode received a set of copy constructor and copy assignments.

This required to deep copy FileNodeData as well. For now this is resolved with clone()methods to be implemented, requested by IFileNodeData.I tried a template based version where only IFileNodeData contains a clone template. I did not get this to work, and it wasn't as straight forward is the present commits.

One FileNodeData class also needed its own copy constructor since it is holding an raw array. Maybe this can be replaced by a std container.

After that, the latest commits propose a different parsing structure with fewer side loops and also an entry point when we have fully implemented the transactionlog map.

@blu-base
Copy link
Contributor Author

I just found an issue with 830f0dd and 4803cea where i likely messed up while rebasing. I would revise this when you got a chance to examine the current state.

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 28, 2020

Yeah, there is a lot of stuff going on in here, although I've started reading this, I won't get to the bottom of that by this evening.

FYI, master seems to merge onto this, barring a few conflicts in JCID (oh, also anything related to FileNode, but resolving these were more mechanical than anything else). The result of that compiles. In case you were wondering :)

@blu-base
Copy link
Contributor Author

blu-base commented Aug 28, 2020

The commits 05d2043, acec0fc are flawed as well. I'll need to track down where the original changes went during rebasing...

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 28, 2020

I think these are now in eadc26b and cef5958 in master :)

@tshikaboom
Copy link
Owner

tshikaboom commented Aug 28, 2020

This was actually quite straightforward to look at! Thanks for rebasing this and separating the commits in a more logical way than before, this made it much easier to follow through. I thought it would take a longer time, but at this point, 90% of the newly added code consists of classes which do (some) parsing and have getters and setters. These are "simple" enough, I must say I also like the conciseness of streaming the input pointer into the respective members of the class in the parse() functions. So I did not actually read this line by line, as there so much new stuff (this is some thankless work, actually, as it does involve reading all of the spec), and would argue that the way to go would be to fix the boilerplate once we start to actually use them in a meaningful way.

The FileNodeListFragment changes also look good. (I'm not a grumpy guy wrt. changes to my code, I just like non-regressions :) but one can always argue that this library was not functional anyway, to begin with) Regarding FileNodeList itself, that class was more or less my interpretation of a FileNodeList having a certain number of FileNodes, but abstracting away the FileNodeListFragment structures themselves as being in the file, so that's why there's that much jumping around in the code. The main loop's job in the parse() function of FileNodeList more or less combines all the FileNodes (an std::vector) of all the successive FileNodeListFragments into a kind of "higher level" vector containing them all.

I'll have a fresh look at this branch tomorrow again to see if there'd be anything "big" to do, but at this point I'd be more inclined to merge this and deal with the rest later, otherwise all the rebasing and merging and resolving and combining is going to get tiresome (as you'd imagine yourself).. If we get stuck in the details, we'll never get out of this. And there is nothing screaming "obviously wrong" either in the code, so, yeah :)

I'll have another look at the fragment parsing anyway tomorrow to see where it gets used, and how we could dedup some code (if any), but, say, I can already say that the parseFileNodeListFragments() function obsoletes like 90% of the functionality of my own FileNodeList class, the semantics of how this is all presented is all different, but this is just details.

@blu-base
Copy link
Contributor Author

I'll have another look at the fragment parsing anyway tomorrow to see where it gets used, and how we could dedup some code (if any), but, say, I can already say that the parseFileNodeListFragments() function obsoletes like 90% of the functionality of my own FileNodeList class, the semantics of how this is all presented is all different, but this is just details.

I added this, because in libmson i also had about three occurrences of that same loop. When I fixed a bug at one of these occurrences, and continued to work one something else, i came across another bug which was familiar... it was the same bug at a different location. we'll need this in the RootNodeFileList, ObjectSpaceManifestList, and RevisionManifestList.

@tshikaboom
Copy link
Owner

Hey, sorry, I got caught up into stuff and haven't had the energy to at least do a high-level todo-list (to not forget anything) before merging this, I'll try to do that this weekend.
Cheers!

This commit replaces most occurances of librevenge::RVNGInputStream
*input with the shared_ptr specified in libone_util.h, which is
typedefed as libone::RVNGInputStreamPtr_t
blu-base and others added 18 commits September 9, 2020 00:34
while refactoring FileNode, the wrong FileChunkReference has been
inserted. That wrong FCR references the FileNode, not the Chunk.
Now the respective FileNodeData is casted to get the correct Ref.
This doubles as marker when no FileNodes are left in a
FileNodeListFragment
this removes the structures which checks for the end of
the FileNodeListFragment. Moreove, it removes FileNode skipping,
since all types of Nodes can be parsed now. And finally, the
FileNodeListFragment's padding skipper is removed because its prone
to found invalid FileNodes when the padding is not all zerod.

Additionally a switch for TransactionEntry's count for specific
FileNodeListFragment is added, though, there is no direct way
to get that information yet. Maybe, the transaction log could
be parsed for each FileNodeListFragment if the overhead is
small, since it's location is given in the header also found
in the input stream which is given to FileNodeListFragment.
The previous commit modified the parsing structure for
fileNodeListFragments. There is no more walker for zeroed padding.
That's why FileNode can parse a zeroed head - which allowed d to
become zero, which is not valid.
However, this validity check only applies to valid filenode structures.
the respective switch structure interferes with the discreet
FileNodeData. The previous FileChunkReference m_fnd is now
contained in the respective FileNodeData (if applicable).
@blu-base
Copy link
Contributor Author

blu-base commented Sep 8, 2020

I had another go at rebasing to get rid of the issues i had missed the previous time.

Whenever possible, i tested building and running one2raw. Most steps seem to not break anything.
However, i have a few commits which rely on code which is submitted in subsequent commits. I did this to keep chunks small. But i am not sure whether this is considered to be an acceptable style, since it "breaks" things until the respective code has been added.

Regarding parseFileNodeListFragments() and FileNodeList: I hadn't looked into FileNodeList too much and was preparing to follow the route i was going with my other attempt with libmson. Maybe "forgetting" about the Fragments after parsing to keep the complete FileNodeList is the better strategy: After parsing we'll need to dissect the FileNodeLists to extract the information which composes the document. I imagine this would mean adding a layer of abstraction between the document composer and the raw parsing loop to get a flatter object tree to work with.

From another perspective, keeping parseFileNodeListFragments() and integrating it into FileNodeList would allow the reuse of parseFileNodeListFragments() elsewhere, though, i don't see where this is necessary when FileNodeList would return the same information.

Let me know whether to drop 797b8b5.

@blu-base blu-base changed the title WIP implement FileNodeData classes implement FileNodeData classes Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants