-
-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector/Matrix/Multi-dimensional Column Types #1447
Comments
Firstly, this is some great rambling, thank you! It is indeed helpful.
This is a really good way of explaining one of the key benefits of spreadsheets. (And which we try to keep with Grist.) Overall, you are right that the reason Grist is lacking in features related to data science is that it has indeed targeted a more business-oriented userbase -- it stands out far more when compared to Excel used as a database, than it does over data science tools. But there have always been wishes for better features related to data science. For instance, what's actually blocking this issue on supporting data science libraries is the lack of a good story in Grist on working with tabular inputs/outputs (as opposed to just scalars). This is also hampers some other features (like this issue about scheduling things like fetches), which will often deal with tabular values. Your write-up is a good starting point both for motivating further work, and identifying some needs and challenges. Let me ask you some follow-up questions -- I'd be interested in your take on them. Let's imagine that Grist does get a data type that stores a 1D numerical data in a cell. What features of Grist would be important in this case:
One very raw idea here is to introduce a concept of a stand-alone "data frame" in Grist (like a pandas data frame), plus a column type that's a "window" into it. Perhaps several different tables could have columns that are windows into the same data frame. Also, if you have references to existing tools that do a great job with any of this, we should learn from those. |
Describe the problem to be solved
Hi,
tl;dr
How feasible would it be to add - initially, 1D only - multi-valued numerical data as a type for a column? There are times when each row in a table represents and should represent one datapoint, and each column attaches and should attach semantic meaning to one property of that datapoint - but the data themselves can't be expressed as a single scalar value. For two dimensions, adding loads of extra columns is ugly, and restructuring the whole file to get around having two dimensions to play with leads to an ugly solution.
Motivation
Brand new to Grist; I was looking for a self-hosted tool to knock up a minimal low-code database dashboard thingamajiggy to track some files relating to some experiments and share them with external users, to make it easier to search for files we want.
However, I work for an AI company, and I decided to try Grist because out of the million options for this sort of thing I could find, the emphasis on starting on spreadsheets and moving upwards to make them more formal seemed to make more sense than applications focussed on starting with app development/relational databases and moving down to make them easier. Most of my colleagues are very used to a workflow involving CSV/JSON, pandas dataframes for typed tabular data, etc., so Grist seemed a more obvious candidate to roadtest.
I had a few questions about custom datatypes, which probably belong more on discord/discussions tab, however, one thing that came to mind, and might be harder to implement, was the possibility of vectors, or perhaps even higher dimensions, for numeric datatypes.
These are very important for any kind of machine learning or scientific analysis, and whilst often two dimensions of column for semantic meaning, row for repeated records, is easier and makes sense for those specific data, times do arise where one needs more dimensions in the data. If using CSV, now you have to resort to the filesystem, e.g. one file per matrix, one folder per record. I know vector databases have been growing in importance as managing input data, weights, datasets for a training pipeline becomes more commonplace. Whilst something that large probably doesn't have much of a usecase for a UI, my colleagues and I are often dealing with relatively small datasets, writing and training models from scratch. I am also thinking that there may be an academic usecase here too, for managing data amongst students, researchers, and online tutorial examples (and hooray - Grist static could be easily embedded in readthedocs.io etc.). Anyone that's had to run a random jupyter notebook in the cloud and found the data pulled from a URL have since changed format, arbitrary integer indices into a big table no longer work, and being python, with no static typing, the whole thing ends up entirely broken without giving much of a clue as to why, hopefully knows extremely annoying all this is.
I get that Grist's main target userbase is probably a lot more business-oriented, so maybe this seems out-of-scope, however, my own personal userstory is there are several occasions where navigating an arbitrarily-structured folder tree and opening flat files one-by-one becomes rather a pain in the arse, and whilst the users may be very different, after years programming I can't help but feel the scientific community suffers form exactly the same problem - an historical lack of software options somewhere in the middle between a bunch of CSV files with a given filename or an onerous complex system for managing data crafted by someone over a long time. People either use LibreOffice to view data, or write something in code by hand, when I wish I had a decent dataflow-graphing sort of solution for running pipelines where I can drag and drop links between nodes and see the intermediate results easily.
Anyway, I'm personally investigating options for gently encouraging my colleagues to think about organising their data just a little bit more, especially if this helps us collaborate without having to ping things back and forth between different machines, and grist seems less mental overhead for them than some other options I saw. So perhaps a different user perspective that might be helpful, and I know "MLOps" is now a job title, so I expect it's a market that will probably grow over time, and thought I'd ramble a bit in case that's helpful.
Good support for arbitrary dimensions in numerical datatypes within the value for a particular cell would be a definite plus for any piece of software I eventually decide to host on a more permanent basis.
Cheers
Describe the solution you would like
Potential implementation Issues
The text was updated successfully, but these errors were encountered: