-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf, memory: Improve performance and memory use for large datasets #5927
base: main
Are you sure you want to change the base?
Conversation
This idea is awesome, and it's something that I've been thinking about for a long time. However, it's a bigger change and I would personally like to see this PR targeted at alpha so we can test the pattern there and even take this idea further. V9 probably won't be stable for a few more months, but I hope that you still would be able to help implement this there. We should follow this same pattern for not only rows, but also cells, headers, and header groups. The table object is the only one where I don't see this as necessary since there is only 1 table created. What do you think? |
@KevinVandy Thank you for the quick response! Is there any way we can apply this fix to V8 (pretty please)? This is a bit of a pressing issue for us. We use the table in Notebooks in our product (Databricks), where it's not unusual to have dozens of tables on a page with up to 10k rows each, which results in the page taking up multiple gigabytes of memory, with over 70% of it coming from the table. We have some ideas for workarounds in our code, but they are somewhat brittle/hacky, and I would much prefer to contribute to improving the table itself. We could increase the scope of this change and apply the pattern more broadly (I saw a lot of areas where we could improve scalability wrt efficiency), but my goal here was to make a minimally invasive fix w/o changing the API that gets us most of the way there. As is, this PR reduces table memory use by ~28x, which is more than enough to address our current needs. In terms of applying the same pattern to headers and header groups, yes, that would have a similar impact on tables with a very large number of columns, but that is a lot less common than tables with lots of rows. We're talking about tens of thousands of columns here. It does happen, but is something that can easily be added in a separate PR. Applying this to cells is not as effective however, since cells are created on-demand as they are rendered, so unlike rows, they don't use up memory unless you render (or scroll through in case of virtualization) tens of thousands of cells. |
I'm definitely still open to merging this to v8, I'll just need to do some extensive regression testing |
View your CI Pipeline Execution ↗ for commit a964f41.
☁️ Nx Cloud last updated this comment at |
@mleibman-db You can install For the alpha branch, it would need a redo instead of merging up. So your help would be appreciated there if you have time. And yes, I forgot to include |
@KevinVandy I don't do much OSS development, so I'm going to need some help / hand-holding here :) Do I need to do a separate PR to apply the changes to the alpha branch? Is that instead of this one, or in addition to? Not sure what I need to do here. Re: doing the same thing for |
The scope of this pr to the main v8 branch is fine/good. However, the alpha v9 branch has been heavily refactored with new approaches to assigning APIs to these objects. In the v9 alpha branch, I'd hope to find an approach that follows this new strategy for everything as much as possible. |
Ok, so IIUIC, I'll leave this PR as-is to proceed with code review, testing, and inclusion in v8, and will look at v9 to see how things are different there and what I need to do to re-apply them there. |
That would be awesome. I realize the follow up for the v9 alpha work is a big extra ask, but hopefully a fun and interesting thing for you to look at helping us out. It will need a slightly different approach. One of the main goals of v9 is to strip down the bundle sizes (and memory usage) of table instances down to just the features that apps are actually using. This PR is very much on theme for that. |
Will do! |
packages/table-core/src/core/row.ts
Outdated
original: undefined as TData, | ||
subRows: [], | ||
_valuesCache: {}, | ||
_uniqueValuesCache: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we might get in trouble by storing these values on the proto? The wisdom I heard last time I worked on stuff using the prototypes was to only store functions on the prototype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exception of subRows
, they are all unused (never referenced on the proto) and are only here for TypeScript type checking.
The advice you heard is most likely referring to potential confusion if one ends up modifying a property on the proto instead of an instance, which results in the prop changing in all instances. This is not happening here for subRows
since it is never modified directly and is only reassigned, which sets the value on the instance.
@@ -411,6 +403,23 @@ export const ColumnFiltering: TableFeature = { | |||
|
|||
return table._getFilteredRowModel() | |||
} | |||
|
|||
Object.assign(getRowProto(table), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point, we had removed most usages of Object.assign
in favor of direct assignment as a performance improvement at scale. Wonder if that's still applicable to consider here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be an issue here since it's only called once per table anyway. Your question would apply more to createRow()
in row.ts
since we call it once per row there, but AFAIK, there are no known performance issues around Object.assign()
. There have been some many years ago when it was just introduced and browser support was fresh (plus there were polyfills), but that hasn't been the case in quite some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KevinVandy beat me to it. I like the idea but am not a big fan of typing the prototype as CoreRow which is not strictly accurate, (and requires us to create these dummy values to keep typescript happy).
@mleibman-db did you try making the createRow function into a constructor function, adding the methods directly to the prototype? I haven't tried it myself but intuitively it feels like it should work. Would need to always call createRow with the new keyword I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Typing the row proto as
CoreRow
is actually very useful since it provides type safety and makes sure the methods only access defined props there. The use of default unused values there doesn't strike me as concerning, but we could try to replace them with some purely TypeScript type annotations, though IMHO that would be more hacky. -
I'm not sure I understand what you're proposing. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it provides type safety
It's the wrong type though, isn't it? The prototype shouldn't have the instance properties on it.
Could you elaborate?
I am imagining something approximately like the below. I haven't tried but think it should work, happy to be corrected. The naming would be a bit weird though. createRow
should probably become just Row
, but that would be a breaking change - not sure what to do about that.
const createRow = <TData>(
this: CoreRow<TData>,
table: Table<TData>,
id: string,
original: TData,
rowIndex: number,
depth: number,
subRows?: Row<TData>[],
parentId?: string
) => {
this.id = id
this.original = original
// etc.
}
createRow.prototype.getValue = (columnId: string) => {
// ...
return this._valuesCache[columnId] as any
}
elsewhere:
const row = new createRow(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anywhere where we are thinking that an alternative would be cleaner, but it's a breaking change, can be reserved for a v9 pr. So far this PR looks mostly good. We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.
If the Object.assign only gets called once, that is negligible and something we don't need to worry about. Direct assignment was a performance improvement in this pr that sped up rendering when creating 10k+ rows. This PR is solving the memory side of that same issue. In conclusion, I'm not worried about this after you explained more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can't just create the proto once at the module level
But I think you can merge the feature.createRow
prototypes into the prototype of the object returned by the core createRow
function at runtime, when new createRow()
is called. In the same loop where we currently call feature.createRow
in the core createRow()
function body. I haven't tested this though. In this case the prototype's methods would be created at module level on each of the features' createRow
functions.
vastly preferring classes
Personally I am not opposed to using a class if it makes typing easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and just for anyone reading this ... the code snippet in this comment should be using function createRow() {}
, not an arrow function!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But happy to change if you feel strongly about it
I was actually agreeing with you that since it's not called many times, it wouldn't be likely to cause issues. I was just trying to explain the likely cause of perf issues - not due to Object.assign()
itself, but rather the fact that it it often called like this:
Object.assign(
targetObject, // <-- existing object
{
// new source object which will be garbage collected eventually
},
)
If it's used this way in a loop with many thousands of iterations, you can run into perf issues due to garbage collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.
Done.
@@ -362,14 +362,6 @@ export const ColumnFiltering: TableFeature = { | |||
} | |||
}, | |||
|
|||
createRow: <TData extends RowData>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the core createRow
function, we still call these feature.createRow
functions if they exist, passing them the row and table instance. That should prevent breaking changes for existing custom features, but we may want to recommend custom features to take the same approach (i.e. extend the prototype). @KevinVandy what do you think about this?
I haven't thought all the details through but something like retaining a createRow
function in each feature, and in the core createRow
function both calling the feature.createRow
function with the row and table instances (to prevent breaking changes for existing custom features), and also merging its prototype onto the core createRow
prototype.
That way we could also retain the createRow
functions in the core features, (just move the methods onto the prototype), and wouldn't need the getRowProto
and Object.assign()
approach I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to generally recommending people use the same approach for implementing custom features. I considered making things more explicit by adding methods like initRowProto()
to TableFeature
interface, but decided against it for simplicity's sake, plus this is more of an internal implementation detail than a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of pattern will be useful to think about in the alpha branch though
This PR moves all duplicate row instance methods to row proto, drastically reducing memory use for large datasets.
Fixes issue #5926.
For a table of 50'000x5, this reduces table memory use from 136 Mb to 4.8 Mb (28x).
The initialization time (
accessRows()
) is also reduced from 132ms to 18ms (7x).Before

After

Deltas
