A type-safe, realtime collaborative Graph Database in a CRDT

(codemix.com)

63 points | by phpnode 3 hours ago

5 comments

2ndorderthought 2 hours ago
Can anyone explain why it is a good idea to make a graphdb in typescript? This not a language flamewar question, more of an implementation details question.
Though typescript is pretty fast, and the language is flexible, we all know how demanding graph databases are. How hard they are to shard, etc. It seems like this could be a performance trap. Are there successful rbdms or nosql databases out there written in typescript?
Also why is everything about LLMs now? Can't we discuss technologies for their face value anymore. It's getting kind of old to me personally.
[-]
- phpnode 2 hours ago
  I needed it to be possible to run the graph in the browser and cloudflare workers, so TS was a natural fit here. It was built as an experiment into end to end type safety - nothing to do with LLMs, but it ended up being useful in the product I'm building. It's not designed for large data sets.
  [-]
  - rglullis 55 minutes ago
    > It's not designed for large data sets.
    How large is large, here? Tens of thousands of triples? Hundreds? Millions?
    I'm working on a local-first browser extension for ActivityPub, and currently I am parsing the JSON-LD and storing the triples in specialized tables on pglite to be able to make fast queries on that data.
    It would be amazing to ditch the whole thing and just deal with triples based on the expanded JSON-LD, but I wonder how the performance would be. While using the browser extension for a week, the store accumulated ~90k thousand JSON-lD documents, which would probably mean 5 times as many triples. Storage wise is okay (~300MB), but I think that a graph database would only be useful to manage "hot data", not a whole archive of user activity.
    [-]
    - phpnode 52 minutes ago
      it depends on the backing store (which is pluggable). I would not want to exceed let's say 50Mb in a Y.js doc, but i've tested the in memory storage with graphs approaching 1Gb and it's been fine - like any graph it really depends on how you query it. Most of the docs I'm dealing with in production are less than 10Mb, so this is fine for my use cases, but... buyer beware!
  - ForHackernews 6 minutes ago
    You can use webassembly in both places https://developers.cloudflare.com/workers/runtime-apis/webas...
  - 2ndorderthought 2 hours ago
    Makes sense thanks for explaining the use case. The LLM question was only because of the comments at the time of the post.
    The query syntax looks nice by the way.
    [-]
    - phpnode 2 hours ago
      thanks, it was as close to Gremlin[0] as I could get without losing type safety (Gremlin is untyped)
      [0] https://tinkerpop.apache.org/
- voodooEntity 59 minutes ago
  I kinda feel ya.
  I wrote my own in-Memory Graph (i'd rather call it storage than a DB) some years ago in golang, even there i was wondering if golang actually is the optimal technology for something like a database especially due to the garbage collection/stop the world/etc. Its just there are certain levels of optimization i will never be able to properly reach (lets ignore possible hacks). Looking at a solution in typescript, no matter how "nice" it looks, this just doesnt seem to be the correct "tool/technology" for the target.
  And inb4, there are use cases for everything, and same as i wouldn't write a website in C, i also wouldn't write a database in javascript/typescript.
  I just would argue this is the wrong pick.
  @llms : im not even getting into this because if you dont wanne read "llm" you basically can't read 99% of news nowadays. ¯\_(ツ)_/¯
  edit: im a big fan of graph databases so im happy about every public attention they get ^
brianbcarter 1 hour ago
Cypher-over-Gremlin is a smart call — LLMs can write Cypher, makes the MCP angle viable in a new way.
How dos Yjs handle schema migrations? If I add a property to a vertex type that existing peers have cached, does it conflict or drop the unknown field?
lo1tuma 2 hours ago
15 years ago I was a big fan of this chaining methods pattern. These days I don’t like it anymore. Especially when it comes to unit-testing and implementing fake objects it becomes quite cumbersome to setup the exact same interface.
[-]
- phpnode 1 hour ago
  unfortunately it's unavoidable if you want to preserve type safety. I did consider parsing Cypher in typescript types, but it's not worth the effort and it's not possible to do safely.
  [-]
  - rounce 43 minutes ago
    Why not with a pipe that returns a function, the type of which is determined by the args of the pipe? That is possible to make typesafe in TS. That way you can have both APIs where the chained version is just wrapping successive pipe calls.
cyanydeez 2 hours ago
Eventually someone will figure out how to use a graph database to allow an agent to efficiency build & cull context to achieve near determinant activities. Seems like one needs a sufficiently powerful schema and a harness that properly builds the graph of agent knowledge, like how ants naturally figure how where sugar is, when that stockpile depletes and shifts to other sources.
This looks neat, but if you want it to be used for AI purposes, you might want to show a schema more complicated than a twitter network.
[-]
- plaguuuuuu 20 minutes ago
  im pretty sure gastown (the Beads part) stores tasks/memories/whatever in a DAG but I haven't looked into it in detail
- j-pb 1 hour ago
  Working on exactly that! We're local first, but do distributed sync with iroh. Written in rust and fully open source.
  Imho having a graph database that is really easy to use and write new cli applications on top of works much better. You don't need strong schema validation so long as you can gracefully ignore what your schema doesn't expect by viewing queries as type/schema declarations.
  https://github.com/magic-locker/faculties
- embedding-shape 2 hours ago
  I'd wager the problem is on the side of "LLMs can't value/rank information good enough" rather than "The graph database wasn't flexible/good enough", but I'd be happy to be shown counter-examples.
  I'm sure once that problem been solved, you can use the built-in map/object of whatever language, and it'll be good enough. Add save/load to disk via JSON and you have long-term persistence too. But since LLMs still aren't clever enough, I don't think the underlying implementation matters too much.
  [-]
  - lmeyerov 2 hours ago
    It's interesting to think of where the value comes from. Afaict 2 interesting areas:
    A: One of the main lessons of the RAG era of LLMs was reranked multiretrieval is a great balance of test time, test compute, and quality at the expense of maintaining a few costly index types. Graph ended up a nice little lift when put alongside text, vector, and relational indexing by solving some n-hop use cases.
    I'm unsure if the juice is worth the squeeze, but it does make some sense as infra. Making and using these flows isn't that conceptually complicated and most pieces have good, simple OSS around them.
    B: There is another universe of richer KG extraction with even heavier indexing work. I'm less clear on the ROI here in typical benchmarks relative to case A. Imagine going full RDF, vs the simpler property graph queries & ontologies here, and investing in heavy entity resolution etc preprocessing during writes. I don't know how well these improve scores vs regular multiretrieval above, and how easy it is to do at any reasonable scale.
    In practice, a lot of KG work lives out of the DB and agent, and in a much fancier kg pipeline. So there is a missing layer with less clear proof and a value burden.
    --
    Seperately, we have been thinking about these internally. We have been building gfql , oss gpu cypher queries on dataframes etc without needing a DB -- reuse existing storage tiers by moving into embedded compute tier -- and powering our own LLM usage has been a primary internal use case for us. Our experiences have led us to prioritizing case A as a next step for what the graph engine needs to support inside, and viewing case B as something that should live outside of it in a separate library . This post does make me wonder if case B should move closer into the engine to help streamline things for typical users, akin how solr/lucene/etc helped make elastic into something useful early on for search.
    [-]
    - alansaber 1 hour ago
      I'm conceptually very bullish on B (entity resolution and hierarchy pre-processing during writes). I'm less certain than A and B need to be merged into a single library. Obviously, a search agent should know the properties of the KG being searched, but as the previous poster mentioned, these graph dbs are inherently inaccurate and only form part of the retrieval pattern anyway.
      [-]
      - lmeyerov 8 minutes ago
        Maybe it's useful to split out B1) KG pipelines from the choice of B2) simple property graph ontologies & queries vs advanced rdf ontologies and sparql queries
        It sounds like you are thinking about KG pipelines, but I'm unclear on whether typed property graphs, vs more advanced RDF/SPARQL, is needed in your view on the graph engine side?
- phpnode 2 hours ago
  the airline graph is more complex, I can show the schema for that if you think it's useful?
llmradar 1 hour ago
[dead]