Daz loading/saving is slow, single-threaded, and doesn't cache enough
I've been working on what is apparently a relatively large Daz project (though 28 megs doesn't sound large to me), and I'm running into some really bad performance when loading and saving projects, turning on Iray, etc. Opening the project or enabling Iray mode takes about a minute, and saving the file (with no changes) takes about 20 seconds.
So putting on my programmer hat, I decided to do some analysis to see what was happening. To make a long story short, loading, saving, and switching to Iray mode all involve long periods of time in which everything is happening entirely on the main thread, so you're getting no benefit from multithreading (and the main thread is blocked). This appears to largely be because all of the conversion to and from the underlying data happens while serializing it to disk using a Qt JSON API.
It seems to me that this code would benefit greatly from a few engineer weeks of optimization — specifically:
- Parse the JSON first and store it in an intermediate form.
- Split the JSON objects up into groups and deserialize them into C++ objects on multiple threads in parallel.
- Consider using a more efficient JSON library like protobuf, rather than QObject.
That last one is a big one. They did some benchmarks here:
https://www.qt.io/blog/2018/05/31/serialization-in-and-with-qt
and QObject's JSON serialization took 263 ms to write 10,000 records to disk. Protobuf wrote it in binary form in just 5 milliseconds, which is more than 52x faster. Protobuf's binary encoder is only about 5x faster than its JSON encoder, so you could likely get a 10x win just by switching to the protobuf library's JSON encoder, which would maintain full compatibility with existing versions of the app.
But the point of the protobuf suggestion wasn't so much that Daz should move to a new file format, but rather that by using a different in-memory format, it would then be trivial to create an ultra-fast binary protobuf cache for the mountain of largely unchanging purchased assets that users have in their Daz libraries, and slow-load those files from JSON only if the files have changed since the binary protobuf cache was last updated, resulting in that 52x performance boost 99.99% of the time.
By deferring processing of the main file until all referenced projects are loaded, and further distributing the workload of parsing all of the referenced projects onto a pile of worker threads, Daz could likely gain major additional performance wins, to the tune of almost another order of magnitude, assuming that the average computer has at least 8 real or hyperthreaded cores. It might also be highly beneficial for each decoded object to cache the encoded form of that object so that it can write the exact blob of JSON data back to disk as-is if nothing changed since it was read from disk (and then do a reindent pass at the end, which, being a really simple O(n) operation, should be trivial even for huge files).
So I suspect that with just a bit of effort, Daz could improve load performance by more than a factor of 400x, bringing load times down from painful territory into "oh, it's done" territory.
Anyway, this is mostly stream-of-consciousness ideas about how performance could be improved. But the main takeaway is that I think load and save performance could be improved by at least an order of magnitude, and probably more like two orders of magnitude, by careful use of caching, better use of multithreading during the decode process, and other probably relatively straightforward architectural changes to the loading and saving code.
What are the odds of convincing Daz to make these sorts of low-hanging fruit performance wins a priority for the next version? Anybody know a way to get these ideas to somebody who can actually make it happen?
Comments
The big delay is not apaprently reading from disc but setting up the links between properties - compare loading a morph-heavy human figure with loading a big scene that has a lot of geometry but few, if any, links between properties.
Don't forget that a scene file, if that si where you are getting your scene size from, is pointing to external assets which may well have their own dependencies - and that those interrelations may well be a serious obstacle to multi-threading.
I think @dgatwood has a point there, as the asset files (apart from HD ones) are all saved in human readable format, which is handy for tinkering with them, but the computer thinks in different language... The cache file could speed up the loading process, but currently that one is also in human readable format and needs translating when loading a figure. Even if nothing else were changed but the cache file to be stored in compiled form, that could already speed up the process considerably with large asset libraries as the cache would not need to be translated every time one loads a figure. Of course rebuilding the cache after making changes to the morph library would take longer, but that would be done only when something was changed, not everytime one loads a figure.
The scene in question has lots of geometry, but just single-digit morphs. I doubt that performing those morphs takes a meaningful percentage of the load time.
I'm not sure what you mean by "links between properties". Do you mean things like associating the path of an asset (e.g. a material), with the actual data? Isn't that basically just a matter of looking up a UUID or path or whatever in a hash? Or are you talking about something else?
Normally, the way I've always done things like that is to create a dependency graph and then process from the leaves upwards. For example:
https://github.com/aosm/headerdoc/blob/master/headerDoc2HTML.pl
starting in line 3873. Create the graph (taking care to do something special for any circular dependencies, assuming that's even possible in this data model, which it hopefully isn't), then iterate it recursively, keeping track of the current depth. Each time you reach a node node, annotate it with the depth of the node. If you hit a node more than once, increase the depth as needed, but never decrease it. Then, iterate the graph a second time and add each node into an array based on its marked (worst-case) depth. Then process the the arrays in order, starting with the highest-numbered array and working your way to the lowest-numbered array. In a single-threaded world, that guarantees with absolute certainty that all dependencies are processed before they are needed.
In a multithreaded world, it is slightly harder, but only because you have to keep a hash table or similar with the status of each processing task and then do a condition wait or similar if one of your dependencies hasn't been processed yet. But if each helper thread pops items off the highest nonempty thread first, you won't do *much* waiting except right as you switch levels. Or you can simplify that approach by running the helpers in parallel on the highest level and waiting to start any helpers on the next level until the last helper at the previous level finishes processing. You'd take a slight parallelism hit by doing that, but you'd save a lot of code complexity. Either way.
By links I meant things like the link between a full body shape adn the ehad and body shapes, or between those and corrective morphs for bends and other morphs.
Ah. Got it. Connections between bones in a wall that has an attached door, for a non-figure example. Presumably each one is a separate object that has to be constructed, with a parent-child relationship to the owning object. But with only 887 total nodes, that's not a lot of work, I wouldn't think. Same with the morphs. If they take a fraction of a second to construct from an obj file source, I'd expect them to take a fraction of a second to parse out of the .duf file, too.
And there are only 887 nodes in this project, each of which presumably is backed by a single object in memory, give or take, plus whatever data structures are used for keeping vertices and edges between vertices and polygons in memory. Daz is using 9.958 gigabytes of RAM after loading the 28-megabyte model. But of course, there are 719 external model file fragments referenced by this project, with a lot of duplication (multiple parts of the same model). There are 362 actual files totaling 162,280,256 bytes if you ignore the auto_adapted morphs that don't seem to actually exist as files on disk and one tiny little model that for some reason I can't actually find. :-/ So in total, it probably pulls in on the order of 190 megabytes of compressed JSON data.
So that gives some idea of how much expansion is involved in model loading. Subtracting the 360 megabytes that Daz uses without a file open, that means that the in-memory representation is around 50x the on-disk size. That's some serious memory use. But it's not the memory use that's a concern. I mean, I have 64 GB of RAM, so 10 GB... I've forgotten how to count that low. :-D The concern is that it must be creating a crazy number of relatively large objects in memory to use that much space, and it is redoing all of that work every time it loads the process.
To be fair, the JSON is compressed, and uncompressed, it balloons by 15x or so, but the 28 megabyte file that balloons to 383,498,244 bytes really contains only 236,003,387 bytes of content and 147,494,857 of strippable whitespace (based on minifying it), so really more like 10x. But given that the bulk of the data in the JSON file is structure, rather than data, that's misleading, because after all, most of that structure takes up zero space when you load data (unless the in-memory representation is something absurd, like storing the JSON keys in a hash table), because it gets computed in the form of offsets into a struct. Though to be fair, there's no way to know how many unused properties get skipped during serialization, which can bloat data structures in memory. Anyway, as a ballpark estimate, stripping the keys brings it to under 100 megabytes of data. So basically, the compression is really only shrinking the data to maybe a third the size of the actual values, which probably puts the compressed binary right in line with what an in-memory binary representation of most of those values would look like size-wise (with floats instead of ASCII strings, etc.). So yeah, that's probably a real 50x bloat, give or take.
Anyway, I'm thinking that the performance hit has to have something to do with converting the vertexes or faces into object form to reach the orders of magnitude of effort required to take a minute to read the file. Here are the stats on the project:
Total Vertices :
2,874,912 / 2,115,894
Total Triangles :
466,507 / 466,747
Total Quads :
2,410,830 / 1,693,332
Total Faces :
2,877,337 / 2,160,079
Total Lines :
3,934 / 3,934
That comes out to 21.5 microseconds per vertex of computation, or almost 67,000 clock cycles per vertex. That's a *lot* of cycles.
If you figure finding the final position of a vertex is a single fused-multiply-add (is it?), then that's on the order of 9 million operations, at 4 operations per cycle, with 4-cycle latency. So if you could actually keep the pipeline full, an M1 Max core would crunch through that in about 0.7 milliseconds.
And parsing 9 million floating-point values from a text file... takes four tenths of a second (I timed it), which is also not enough time to matter. Even if I malloc storage for each value individually ahead of time, it barely crosses the half-second mark.
So where do all those cycles go? I have no idea. My guess is that it comes from actually converting meshes to their in-memory representation. Each vertex is probably an object, which they presumably have to store in a temporary array, and each polygon is probably an object that presumably contains an array of pointers to vertex objects, which they then have to look up in that temporary array while constructing the pointer array.
But even still, everything I just described should be O(n) operations unless they're using something bloated and slow like STL vectors (or worse, some abstraction layer on top of that) for the array construction, adding them one at a time or something. I mean, based on a quick run in Instruments, the fact that it spends almost 3 seconds of time freeing QString objects (mostly in calls from findClothNodeDataItem) tells me that this is not as unlikely as I'd like to believe, but....
It also spends almost seven seconds doing JPEG decoding for textures that aren't even visible onscreen and could be lazily loaded when it starts to become plausible that they need to be rendered (and which could very easily be done in parallel, too). And it spends another second or so doing PNG decoding.
And DzFacetVertexSubPath::remapPath is a crazy hot path. 35.86 wallclock seconds were spent in that function, with more than half of that time spent ten or more levels of recursion deep. DzScopeResolver::initIdentifierReplacement seems to be a good chunk of this, at 5.26 seconds, in part because it is doing additional remapPath calls, and those, in turn, are calling QImage::save. The total time in those save calls is almost five seconds. What the heck is Daz doing writing out images while loading a file? Is this part of some sort of image conversion? If so, why not cache the resulting representation and not spend those five seconds next time? If not, what's going on? Please tell me it isn't updating product thumbnails.... :-D
Here's a fun one. It spends 4.2 seconds in QApplication::topLevelAt. I think this is probably an artifact caused by the time while the menu item was stuck down, because they didn't properly defer the start of the loading process until the next run loop cycle. At least I hope that's the case. Otherwise, I have no idea what's going on. :-D
It spends about 2.26 seconds doing QString conversions, and another third of a second converting QString objects to lowercase. It spends a third of a second doing string equality comparisons. I'm guessing those three or four seconds represent parser overhead, and there's probably more where that came from.
Anyway, I guess the real point of all of this is that there's a crazy amount of work happening while loading these files, resulting in multiple orders of magnitude penalty compared with the amount of time it takes to just read the data from disk, and if that work could somehow be preserved and reused, it would be a huge win. And there's a lot of stuff happening, like JPEG and PNG parsing, image conversion (I guess), etc. that could trivially be done on any thread, but isn't. And all of that adds up to a rather large performance bottleneck once your projects start to get large.
On the subject of serialization, one particularly cool (but terrifying) strategy I saw once involved a piece of audio software in the 90s and early 2000s called BIAS Deck. They had a bunch of ostensibly C (but actually probably Pascal!) data structures, and they serialized them out to disk by literally writing them byte-for-byte, pointers and all. Each of the original structs started with a four-byte character code to identify the type of structure, and also contained its own address at a known offset as an identifier. So they would read the four bytes, figure out what type of structure to allocate, and copy the data into the newly allocated chunk of memory, then add it to a mapping table that mapped the old address (in the file) to the new address (allocated with malloc or whatever). Then, after they loaded the data, they iterated through the objects and updated all of the pointers. Note that I'm not recommending this strategy. :-D
I realize C++ objects make serialization harder, but the boost API has serialization code that I've read is pretty decent. That plus a little bit of versioning metadata would probably be adequate. For every Daz model loaded from the library, the first time you load it in a given Daz Studio version, write the data back out to disk with boost serialization so that the final resulting pile of classes and data structures can be quickly reloaded. And of course do timestamp comparisons to ensure that the model hasn't changed. Judging by what I'm seeing, I would expect at least an order of magnitude improvement in performance with that strategy, even if they do nothing else, and possibly two.
What Richard is referring to is the interdepencies which the morph dials have between other morph dials, some dials are changing values on number of other dials as well a number of other dials is changing the value of the dial one is looking at - Determining these interdependencies at loading time is something that takes time.
Today, I learned that joint-controlled and morph-controlled morphs and similar exist. Wow, that's amazing. But no, I'm not using anything like that. This is just a building. The only active morphs in this scene are Morph Loader morphs, one per simple, boneless, jointless object, so none of that is happening in this case. There are a bunch of wall segments that have morphable ends, but all the morph dials are set to zero. I assume those don't impact performance at all, being inactive and all.
Having dials set to zero, makes no difference, as it is the existence of the dials and any interdependence between them that takes time building, but ok, a building would not have many of them.
A character on the other hand, may have thousands of dials and each of the dials may have several dials that are effected by the dials movement and/or several dials may effect the movement of the dial one is looking at.
Sorry to hijack this thead for a related technical question, but I am trying to choose a CPU that would be best optimized for Daz Studio's modeling/editing/loading/saving bottlenecks. There is lots of info out there on how to choose the right GPU for rendering, but I am struggling to find info on hardware choices for all of the active computation that comes before that.
Some articles I've read about CGI active workloads suggest that single core performance is the most important, so I've been looking at purchasing a new AMD platform. AMD is about to release their updated Ryzen 9 7950 with 3D V-Cache: the Ryzen 9 7950X3D which promises to break records for single core performance. The X3D part comes with 144MB of combined cache, while the regular CPU only comes with 80MB. Is this something that Daz Studio can utilize? Are there any other optimizations I can make for active computation, with RAM for instance? Speeding up my editing/animating/modeling workflow is my main goal right now.
Any insight would be truly appreciated. Amazing to see people are thinking about how to optimize this software!
I would not sell a kidney for a CPU, the CPU is not discussed much because it doesn't make that much of a difference, at least not with DS 4.xx - Nobody knows about DS 5 yet.
The components that are important for DS are RAM, min 32GB's but more is better.
The GPU, nVidia RTX GPU with absolute minimum 8GB's of VRAM, but 12GB's recommended
PSU that can handle the load, I would start at 750W.
Of course one can throw money at the computer as much as one likes, but a six year old i7-5820K, running on W7, 64GB's of RAM and a collection of SSD and USB drives handles the load pretty good, so good that I haven't bothered building the new i9-9940X rig yet, for which I already have almost all the components.
You'd think it would be a very simple caching operation wouldn't you.
Whatever's actually going on under the hood has long baffled me. I've written software that parsed and indexed huge sets of files with links before (in this case engineering diagrams, component libraries and so on), where it took quite a long time at startup if components had changed (with a progress bar to show how far along it was) but that took a few milliseconds thereafter. Relationships between components don't change that often, so of course the cache rarely needed to be rebuilt. When it did you could go and make a cup of coffee and it'd be done by the time you returned.
The problem with DS caching is, that the cache is still in human readable format. The text from the installed morphs is just collected into a single large text file (sans the vertex deltas) and the infomation needs to be compiled every time a figure is loaded. Maybe it made sense when the harddrives were a lot slower, but offers little benefit today.
The caches I made were also Human readable form. Depending on the format you choose, parsing can be a very quick operation indeed. I would question the "multi-thread it" answer, which I always see whenever someone speaks of speeding something up. Whether it actually improves performance depends on where you need to enter critical sections, mutexes and so on, and whether other threads need to wait around. There's also a massive slowdown in many algorithms if threads are trampling all over the cache. Sometimes, quite often actually, it's faster to do things in sequence on a single core. But of course it depends...
Finally there's the software maintenance aspect. I always keep the original, slow algorithm hanging around. Very well optimised code can be difficult to read and is often impossible to debug or change.
Hve you looked at the cache files DS creates for figures, they are simply morph files merged one after another just like they are written in the original morph files (sans the vertex deltas)
Did they have links to links to links...? Since more than one morph may drive the same ERC link, and since the destination may drive other ERC links in its turn, this isn't a matter of links between separate lists as you appear to be describing but of propagating self-linking within a single list of items.
From experience I have found that almost all relationships that look like a tree with each node visited multiple times, as you are describing, can be flattened into a single pass. For example, you can flatten an octree into a single array, making in-order operations on its nodes very fast and cache coherent (pre or post order if you like, depending on how you construct the structure). In general for optimising things like this you need an acceleration structure, and those almost always involve finding suitable hashes.
My most recent adventure in data structures was a time ordered sequence where I had to determine everything between two time points. This structure was pretty big (millions of elements) and not in time-order ("almost" in that order but no guarantee something wouldn't come in out of order) so I couldn't use lower or upper bound. What I did instead was a single pass over all the nodes to build an unordered map of vectors of pointers and another vector of unordered maps of arrays of vectors (yes, really). I should probably describe the intention and what the data was but it's not really relevant and would take too long.
Point is the structure allowed me to turn a very slow operation (low seconds) into a very fast one simply by replacing what was a longish search operation with a couple of dictionary lookups and a couple of array index lookups. I even replaced the string keys with pointers to strings, as I had another structure that held the strings that I knew would never be moved. "Insert" on the structure during live operation was slow (~10 ms), but inserts didn't happen very often so that was OK. Lookups on this acceleration structure were unmeasurably fast, i.e. < 1ms, so it could be used for drawing a 30 fps UI component. The time constraint really concentrates the mind doesn't it.
Anyway point is this, Daz can do it I'm sure. It's simply a question of investing in it.