We were only scratching the surface when I was in college, but everyone imagined inference would be much cheaper/more efficient than it ended up being.

If bigger=smarter forever, edge inference will always be relatively slow/dumb.

I've been running it on Apple Metal - torch says it is using the NPU, but the Apple part is probably why it is such a mess.

optimism

afaik if you're running the embedding model on a GPU, or quantized on a CPU, it shouldn't be 

 slow. But I also haven't run much of this stuff locally yet.

tech

devs

Lightweight, in-process (embedded, sqlite-like) vector database

k00b

afaik if you're running the embedding model on a GPU, or quantized on a CPU, it shouldn't be *super* slow. But I also haven't run much of this stuff locally yet.