LLM News Digest

Tag: open-source

Benchmarking Ollama vs LM Studio vs MLX
Intermediate

Benchmarking Ollama vs LM Studio vs MLX

A hands-on performance comparison of three popular local LLM inference tools (Ollama, LM Studio, MLX) that investigates why one tool felt laggy in practice. If you're choosing between local inference options or debugging performance issues with self-hosted models, this benchmarking approach shows how to systematically evaluate tools beyond just theoretical specs.

Takeaways
  • Perceived performance issues with local LLM tools require systematic benchmarking beyond just checking specs on paper.
  • The three major local inference platforms (Ollama, LM Studio, MLX) have measurable differences that affect real-world usage.
  • Proper benchmarking methodology for LLM inference tools should account for both throughput and latency characteristics.
from Apr 27, 2026 · via manual
Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud
Intermediate

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

ikessler

This Chrome extension demonstrates practical browser-based AI deployment by embedding Google's Gemma 4 model locally via WebGPU, complete with webpage interaction capabilities like clicking, typing, and JavaScript execution. It proves that sophisticated AI agents can run entirely client-side without API dependencies, opening new possibilities for privacy-preserving AI tools. The implementation shows how to build truly local AI agents with real-world utility.

Takeaways
  • WebGPU enables running 2B parameter models entirely in the browser without cloud dependencies.
  • Local AI agents can interact with web pages through tool calling while preserving user privacy.
  • Browser-based AI deployment eliminates API costs and latency while maintaining reasonable functionality.
from Apr 6, 2026 · 100 points on HN · via api-hn
Streaming experts
Intermediate

Streaming experts

Breakthrough technique allows running massive Mixture-of-Experts models (up to 1 trillion parameters) on consumer hardware by streaming only the necessary expert weights from SSD for each token. This could democratize access to state-of-the-art models for teams without enterprise-scale infrastructure, though with latency tradeoffs.

Takeaways
  • Streaming expert weights from SSD enables running models 10x larger than available RAM would normally allow.
  • The technique makes trillion-parameter models accessible on consumer hardware, potentially changing deployment economics.
from Mar 29, 2026 · via rss-willison