Vector/Matrix/Multi-dimensional Column Types #1447

stellarpower · 2025-02-14T12:18:17Z

Describe the problem to be solved

Hi,

tl;dr

How feasible would it be to add - initially, 1D only - multi-valued numerical data as a type for a column? There are times when each row in a table represents and should represent one datapoint, and each column attaches and should attach semantic meaning to one property of that datapoint - but the data themselves can't be expressed as a single scalar value. For two dimensions, adding loads of extra columns is ugly, and restructuring the whole file to get around having two dimensions to play with leads to an ugly solution.

Motivation

Brand new to Grist; I was looking for a self-hosted tool to knock up a minimal low-code database dashboard thingamajiggy to track some files relating to some experiments and share them with external users, to make it easier to search for files we want.

However, I work for an AI company, and I decided to try Grist because out of the million options for this sort of thing I could find, the emphasis on starting on spreadsheets and moving upwards to make them more formal seemed to make more sense than applications focussed on starting with app development/relational databases and moving down to make them easier. Most of my colleagues are very used to a workflow involving CSV/JSON, pandas dataframes for typed tabular data, etc., so Grist seemed a more obvious candidate to roadtest.

I had a few questions about custom datatypes, which probably belong more on discord/discussions tab, however, one thing that came to mind, and might be harder to implement, was the possibility of vectors, or perhaps even higher dimensions, for numeric datatypes.

These are very important for any kind of machine learning or scientific analysis, and whilst often two dimensions of column for semantic meaning, row for repeated records, is easier and makes sense for those specific data, times do arise where one needs more dimensions in the data. If using CSV, now you have to resort to the filesystem, e.g. one file per matrix, one folder per record. I know vector databases have been growing in importance as managing input data, weights, datasets for a training pipeline becomes more commonplace. Whilst something that large probably doesn't have much of a usecase for a UI, my colleagues and I are often dealing with relatively small datasets, writing and training models from scratch. I am also thinking that there may be an academic usecase here too, for managing data amongst students, researchers, and online tutorial examples (and hooray - Grist static could be easily embedded in readthedocs.io etc.). Anyone that's had to run a random jupyter notebook in the cloud and found the data pulled from a URL have since changed format, arbitrary integer indices into a big table no longer work, and being python, with no static typing, the whole thing ends up entirely broken without giving much of a clue as to why, hopefully knows extremely annoying all this is.

I get that Grist's main target userbase is probably a lot more business-oriented, so maybe this seems out-of-scope, however, my own personal userstory is there are several occasions where navigating an arbitrarily-structured folder tree and opening flat files one-by-one becomes rather a pain in the arse, and whilst the users may be very different, after years programming I can't help but feel the scientific community suffers form exactly the same problem - an historical lack of software options somewhere in the middle between a bunch of CSV files with a given filename or an onerous complex system for managing data crafted by someone over a long time. People either use LibreOffice to view data, or write something in code by hand, when I wish I had a decent dataflow-graphing sort of solution for running pipelines where I can drag and drop links between nodes and see the intermediate results easily.

Anyway, I'm personally investigating options for gently encouraging my colleagues to think about organising their data just a little bit more, especially if this helps us collaborate without having to ping things back and forth between different machines, and grist seems less mental overhead for them than some other options I saw. So perhaps a different user perspective that might be helpful, and I know "MLOps" is now a job title, so I expect it's a market that will probably grow over time, and thought I'd ramble a bit in case that's helpful.

Good support for arbitrary dimensions in numerical datatypes within the value for a particular cell would be a definite plus for any piece of software I eventually decide to host on a more permanent basis.

Cheers

Describe the solution you would like

Potential implementation Issues

I'm not really big on databases, but I assume the only SQLite type that would suit would be a binary blob. This is more flexible, but has the downside of more weight on Grist itself, and the schema for its files, as to how to handle this. It might risk tying the format more closely with a particular version.
What limits should be placed in the UI for correctly restricting the data in these arrays, and how this would be managed in the underlying backing file. I'm a big fan of static and strong typing. If my data should only ever be 1024*768 or have three dimensions then it's the job of my compiler/the tools I use to stop me from inputting anything else. End of the day it seems to me that's one of the things Grist tries to apply that spreadsheets do not do. But there are also many valid usecases when an axis, or even the number of axes, is dynamic. Equally, the numerical format (single, double, float16, signed, unsigned integers) is something that would be important, and correctly handling NANs, infinities, etc. is also important for numerical work.
IDK how much python is integrated into Grist at its core. But there's obviously a lot of stuff already out there for this. I guess the extent to which this might want leveraging for handling, as well as viewing, multidimensional arrays would be a question. As an example, the VS Code debugger has electron stuff in it for conveniently displaying the contents of an array in a web UI, so it may be possible to integrate third-party packages and not need not re-invent the wheel for some operations.

dsagal · 2025-02-16T18:03:47Z

Firstly, this is some great rambling, thank you! It is indeed helpful.

...the emphasis on starting on spreadsheets and moving upwards to make them more formal seemed to make more sense than applications focussed on starting with app development/relational databases and moving down to make them easier.

This is a really good way of explaining one of the key benefits of spreadsheets. (And which we try to keep with Grist.)

Overall, you are right that the reason Grist is lacking in features related to data science is that it has indeed targeted a more business-oriented userbase -- it stands out far more when compared to Excel used as a database, than it does over data science tools. But there have always been wishes for better features related to data science. For instance, what's actually blocking this issue on supporting data science libraries is the lack of a good story in Grist on working with tabular inputs/outputs (as opposed to just scalars). This is also hampers some other features (like this issue about scheduling things like fetches), which will often deal with tabular values.

Your write-up is a good starting point both for motivating further work, and identifying some needs and challenges. Let me ask you some follow-up questions -- I'd be interested in your take on them. Let's imagine that Grist does get a data type that stores a 1D numerical data in a cell. What features of Grist would be important in this case:

To be able to see such data as a series of individual cells (e.g. to copy-paste from)? What would UI be like?
To be able to edit individual values, or copy-paste into them?
To apply a formula to the full array? Or to any member of the array?
Producing such values with formulas? And what exactly would such formulas look like?
To read/write such arrays via API? Or to read/write individual values in such arrays via APIs?
Since the set of such values across multiple rows makes a matrix, is it important to be able to produce a transposed version of it?
Should other assorted features be applicable to values in such arrays (like number formatting, conditional formatting, sorting/filtering)?

One very raw idea here is to introduce a concept of a stand-alone "data frame" in Grist (like a pandas data frame), plus a column type that's a "window" into it. Perhaps several different tables could have columns that are windows into the same data frame.

Also, if you have references to existing tools that do a great job with any of this, we should learn from those.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector/Matrix/Multi-dimensional Column Types #1447

Vector/Matrix/Multi-dimensional Column Types #1447

stellarpower commented Feb 14, 2025

dsagal commented Feb 16, 2025

Vector/Matrix/Multi-dimensional Column Types #1447

Vector/Matrix/Multi-dimensional Column Types #1447

Comments

stellarpower commented Feb 14, 2025

Describe the problem to be solved

tl;dr

Motivation

Describe the solution you would like

Potential implementation Issues

dsagal commented Feb 16, 2025