Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf, memory: Improve performance and memory use for large datasets #5927

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mleibman-db
Copy link

@mleibman-db mleibman-db commented Feb 21, 2025

This PR moves all duplicate row instance methods to row proto, drastically reducing memory use for large datasets.
Fixes issue #5926.

For a table of 50'000x5, this reduces table memory use from 136 Mb to 4.8 Mb (28x).
The initialization time (accessRows()) is also reduced from 132ms to 18ms (7x).

Before
before

After
after

Deltas
delta

@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets (#5926) perf, memory: Improve performance and memory use for large datasets Feb 21, 2025
@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets perf, memory: Improve performance and memory use for large datasets (WIP) Feb 21, 2025
@KevinVandy
Copy link
Member

This idea is awesome, and it's something that I've been thinking about for a long time.

However, it's a bigger change and I would personally like to see this PR targeted at alpha so we can test the pattern there and even take this idea further.

V9 probably won't be stable for a few more months, but I hope that you still would be able to help implement this there.

We should follow this same pattern for not only rows, but also cells, headers, and header groups. The table object is the only one where I don't see this as necessary since there is only 1 table created.

What do you think?

@mleibman-db
Copy link
Author

mleibman-db commented Feb 21, 2025

@KevinVandy Thank you for the quick response!

Is there any way we can apply this fix to V8 (pretty please)?

This is a bit of a pressing issue for us. We use the table in Notebooks in our product (Databricks), where it's not unusual to have dozens of tables on a page with up to 10k rows each, which results in the page taking up multiple gigabytes of memory, with over 70% of it coming from the table. We have some ideas for workarounds in our code, but they are somewhat brittle/hacky, and I would much prefer to contribute to improving the table itself.

We could increase the scope of this change and apply the pattern more broadly (I saw a lot of areas where we could improve scalability wrt efficiency), but my goal here was to make a minimally invasive fix w/o changing the API that gets us most of the way there. As is, this PR reduces table memory use by ~28x, which is more than enough to address our current needs.

In terms of applying the same pattern to headers and header groups, yes, that would have a similar impact on tables with a very large number of columns, but that is a lot less common than tables with lots of rows. We're talking about tens of thousands of columns here. It does happen, but is something that can easily be added in a separate PR.

Applying this to cells is not as effective however, since cells are created on-demand as they are rendered, so unlike rows, they don't use up memory unless you render (or scroll through in case of virtualization) tens of thousands of cells.

@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets (WIP) perf, memory: Improve performance and memory use for large datasets Feb 21, 2025
@KevinVandy
Copy link
Member

I'm definitely still open to merging this to v8, I'll just need to do some extensive regression testing

Copy link

nx-cloud bot commented Feb 21, 2025

View your CI Pipeline Execution ↗ for commit a964f41.

Command Status Duration Result
nx affected --targets=test:format,test:sherif,t... ✅ Succeeded 1m 51s View ↗
nx run-many --targets=build --exclude=examples/** ✅ Succeeded 34s View ↗

☁️ Nx Cloud last updated this comment at 2025-02-21 20:03:55 UTC

Copy link

pkg-pr-new bot commented Feb 21, 2025

Open in Stackblitz

More templates

@tanstack/angular-table

npm i https://pkg.pr.new/@tanstack/angular-table@5927

@tanstack/lit-table

npm i https://pkg.pr.new/@tanstack/lit-table@5927

@tanstack/match-sorter-utils

npm i https://pkg.pr.new/@tanstack/match-sorter-utils@5927

@tanstack/qwik-table

npm i https://pkg.pr.new/@tanstack/qwik-table@5927

@tanstack/react-table

npm i https://pkg.pr.new/@tanstack/react-table@5927

@tanstack/react-table-devtools

npm i https://pkg.pr.new/@tanstack/react-table-devtools@5927

@tanstack/solid-table

npm i https://pkg.pr.new/@tanstack/solid-table@5927

@tanstack/svelte-table

npm i https://pkg.pr.new/@tanstack/svelte-table@5927

@tanstack/table-core

npm i https://pkg.pr.new/@tanstack/table-core@5927

@tanstack/vue-table

npm i https://pkg.pr.new/@tanstack/vue-table@5927

commit: a964f41

@KevinVandy
Copy link
Member

@mleibman-db You can install npm i https://pkg.pr.new/@tanstack/react-table@5927 right now to try out the preview NPM version in your code.

For the alpha branch, it would need a redo instead of merging up. So your help would be appreciated there if you have time.

And yes, I forgot to include column objects in my original feedback. Those would be 2nd most important. It's interesting that cells don't have this problem as much, but that makes sense.

@mleibman-db
Copy link
Author

@KevinVandy I don't do much OSS development, so I'm going to need some help / hand-holding here :) Do I need to do a separate PR to apply the changes to the alpha branch? Is that instead of this one, or in addition to? Not sure what I need to do here.

Re: doing the same thing for column objects. As I mentioned, in most cases it wouldn't be as impactful since tables with tens of thousands of columns are much less rare than tables with lots of rows, but I'm happy to make that change as well. I'd probably do that in a separate PR though to limit the scope.

@KevinVandy
Copy link
Member

The scope of this pr to the main v8 branch is fine/good.

However, the alpha v9 branch has been heavily refactored with new approaches to assigning APIs to these objects. In the v9 alpha branch, I'd hope to find an approach that follows this new strategy for everything as much as possible.

@mleibman-db
Copy link
Author

Ok, so IIUIC, I'll leave this PR as-is to proceed with code review, testing, and inclusion in v8, and will look at v9 to see how things are different there and what I need to do to re-apply them there.

@KevinVandy
Copy link
Member

Ok, so IIUIC, I'll leave this PR as-is to proceed with code review, testing, and inclusion in v8, and will look at v9 to see how things are different there and what I need to do to re-apply them there.

That would be awesome. I realize the follow up for the v9 alpha work is a big extra ask, but hopefully a fun and interesting thing for you to look at helping us out. It will need a slightly different approach.

One of the main goals of v9 is to strip down the bundle sizes (and memory usage) of table instances down to just the features that apps are actually using. This PR is very much on theme for that.

@mleibman-db
Copy link
Author

Will do!

original: undefined as TData,
subRows: [],
_valuesCache: {},
_uniqueValuesCache: {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we might get in trouble by storing these values on the proto? The wisdom I heard last time I worked on stuff using the prototypes was to only store functions on the prototype

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of subRows, they are all unused (never referenced on the proto) and are only here for TypeScript type checking.

The advice you heard is most likely referring to potential confusion if one ends up modifying a property on the proto instead of an instance, which results in the prop changing in all instances. This is not happening here for subRows since it is never modified directly and is only reassigned, which sets the value on the instance.

@@ -411,6 +403,23 @@ export const ColumnFiltering: TableFeature = {

return table._getFilteredRowModel()
}

Object.assign(getRowProto(table), {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point, we had removed most usages of Object.assign in favor of direct assignment as a performance improvement at scale. Wonder if that's still applicable to consider here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be an issue here since it's only called once per table anyway. Your question would apply more to createRow() in row.ts since we call it once per row there, but AFAIK, there are no known performance issues around Object.assign(). There have been some many years ago when it was just introduced and browser support was fresh (plus there were polyfills), but that hasn't been the case in quite some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinVandy beat me to it. I like the idea but am not a big fan of typing the prototype as CoreRow which is not strictly accurate, (and requires us to create these dummy values to keep typescript happy).

@mleibman-db did you try making the createRow function into a constructor function, adding the methods directly to the prototype? I haven't tried it myself but intuitively it feels like it should work. Would need to always call createRow with the new keyword I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Typing the row proto as CoreRow is actually very useful since it provides type safety and makes sure the methods only access defined props there. The use of default unused values there doesn't strike me as concerning, but we could try to replace them with some purely TypeScript type annotations, though IMHO that would be more hacky.

  2. I'm not sure I understand what you're proposing. Could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it provides type safety

It's the wrong type though, isn't it? The prototype shouldn't have the instance properties on it.

Could you elaborate?

I am imagining something approximately like the below. I haven't tried but think it should work, happy to be corrected. The naming would be a bit weird though. createRow should probably become just Row, but that would be a breaking change - not sure what to do about that.

const createRow = <TData>(
  this: CoreRow<TData>,
  table: Table<TData>,
  id: string,
  original: TData,
  rowIndex: number,
  depth: number,
  subRows?: Row<TData>[],
  parentId?: string
) => {
  this.id = id
  this.original = original
  // etc.
}

createRow.prototype.getValue = (columnId: string) => {
      
    // ...

    return this._valuesCache[columnId] as any
}

elsewhere:

const row = new createRow(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere where we are thinking that an alternative would be cleaner, but it's a breaking change, can be reserved for a v9 pr. So far this PR looks mostly good. We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.

If the Object.assign only gets called once, that is negligible and something we don't need to worry about. Direct assignment was a performance improvement in this pr that sped up rendering when creating 10k+ rows. This PR is solving the memory side of that same issue. In conclusion, I'm not worried about this after you explained more.

Copy link
Member

@tombuntus tombuntus Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can't just create the proto once at the module level

But I think you can merge the feature.createRow prototypes into the prototype of the object returned by the core createRow function at runtime, when new createRow() is called. In the same loop where we currently call feature.createRow in the core createRow() function body. I haven't tested this though. In this case the prototype's methods would be created at module level on each of the features' createRow functions.

vastly preferring classes

Personally I am not opposed to using a class if it makes typing easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and just for anyone reading this ... the code snippet in this comment should be using function createRow() {}, not an arrow function!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But happy to change if you feel strongly about it

I was actually agreeing with you that since it's not called many times, it wouldn't be likely to cause issues. I was just trying to explain the likely cause of perf issues - not due to Object.assign() itself, but rather the fact that it it often called like this:

Object.assign(
  targetObject, // <-- existing object
  {
    // new source object which will be garbage collected eventually
  },
)

If it's used this way in a loop with many thousands of iterations, you can run into perf issues due to garbage collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.

Done.

@@ -362,14 +362,6 @@ export const ColumnFiltering: TableFeature = {
}
},

createRow: <TData extends RowData>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the core createRow function, we still call these feature.createRow functions if they exist, passing them the row and table instance. That should prevent breaking changes for existing custom features, but we may want to recommend custom features to take the same approach (i.e. extend the prototype). @KevinVandy what do you think about this?

I haven't thought all the details through but something like retaining a createRow function in each feature, and in the core createRow function both calling the feature.createRow function with the row and table instances (to prevent breaking changes for existing custom features), and also merging its prototype onto the core createRow prototype.

That way we could also retain the createRow functions in the core features, (just move the methods onto the prototype), and wouldn't need the getRowProto and Object.assign() approach I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to generally recommending people use the same approach for implementing custom features. I considered making things more explicit by adding methods like initRowProto() to TableFeature interface, but decided against it for simplicity's sake, plus this is more of an internal implementation detail than a public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of pattern will be useful to think about in the alpha branch though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants