Run VeloFill with Ollama and Gemma 3
Spin up Gemma 3 through Ollama, point VeloFill at your localhost endpoint, and keep sensitive data inside your network.
Looking for more advanced features? This guide covers the simplest way to connect VeloFill to a local model. If you need to manage multiple models, enforce stricter security, or centralize logging, see our guide on routing VeloFill through LiteLLM.
Why pair VeloFill with Gemma 3
Running Gemma 3 locally gives you fast responses and keeps regulated data off public clouds. VeloFill speaks the same OpenAI-compatible API as Ollama, so with a few tweaks you can level up your form automation without sending prompts over the internet.
Prerequisites
- A machine with a GPU that has at least 6 GB of VRAM for the 4B model (12 GB for 12B, 24 GB for 27B). 16 GB of system RAM is a minimum, with 32 GB recommended.
- Ollama installed and running on macOS, Windows, or Linux.
- VeloFill extension installed in Chrome, Edge, or Firefox.
- Optional but encouraged: GPU acceleration for faster inference times.
Step 1: pull Gemma 3 through Ollama
- Open a terminal on the machine hosting Ollama.
- Ensure the Ollama application is running. If it was installed as a service, it should already be active. If not, you can start it manually with
ollama serve. - Download the Gemma 3 model:
ollama pull gemma3:4b - (Optional) Test the model locally to confirm it responds:
ollama run gemma3:4b "Summarize VeloFill in two sentences."
Tip: If you need more capability and have 12+ GB of VRAM, try
gemma3:12b. For maximum quality with 24+ GB VRAM, usegemma3:27b. The instructions below work the same way, just swap the tag where needed.
Which Gemma 3 Model Should You Use?
Gemma 3 is available in several sizes. Here’s how to choose:
| Model | VRAM Needed | Best For |
|---|---|---|
gemma3:4b |
6 GB | Most users—fast and efficient |
gemma3:12b |
12 GB | Better quality, still responsive |
gemma3:27b |
24 GB | Maximum quality for complex forms |
Looking for Gemma 3 9B? Google’s Gemma 3 does not have a 9B variant. The Gemma 2 family included a 9B model, but Gemma 3 uses different sizes (4B, 12B, 27B for multimodal models). If you have 10-12 GB of VRAM and were hoping to use a 9B model, we recommend
gemma3:12b—it’s the closest match and offers better performance.
Other Model Options
Gemma 3 isn’t your only option. Ollama supports many models suitable for form filling:
| Model | Size | VRAM Needed | Best For |
|---|---|---|---|
llama3.2:3b |
2 GB | 4 GB | Fast, simple forms |
llama3.2:latest |
4.7 GB | 8 GB | General purpose (recommended) |
mistral:7b |
4.1 GB | 8 GB | Strong reasoning |
qwen2.5:7b |
4.7 GB | 8 GB | Good multilingual support |
qwen3:8b |
4.8 GB | 8 GB | Latest Qwen, strong performance |
To use a different model, simply pull it and update the Model ID in VeloFill:
# Example: Use Llama 3.2 instead
ollama pull llama3.2:latest
Then enter llama3.2:latest as the Model ID in VeloFill.
Step 2: Configure CORS for Browser Access
Ollama listens on http://127.0.0.1:11434 by default, but browser extensions require explicit CORS permission. You must set the OLLAMA_ORIGINS environment variable before starting Ollama:
macOS/Linux:
export OLLAMA_ORIGINS="*"
ollama serve
Windows (PowerShell):
$env:OLLAMA_ORIGINS="*"
ollama serve
Important: Restart Ollama after setting this variable. On macOS, quit the Ollama app and reopen it. On Linux with systemd, you may need to edit the service:
sudo systemctl edit ollama.serviceAdd:
[Service] Environment="OLLAMA_ORIGINS=*"Then run
sudo systemctl daemon-reload && sudo systemctl restart ollama.
Warning: The Ollama API provides unrestricted access by default. Do not expose the Ollama port directly to the public internet, as this would allow anyone to use your model.
If you need to access Ollama from a different workstation, use a secure method like an SSH tunnel or a reverse proxy that can add an authentication layer.
Step 3: point VeloFill at Gemma 3
- Open the VeloFill extension and choose Options → LLM Provider.
- Select OpenAI-compatible / Custom endpoint.
- Set the endpoint URL to
http://127.0.0.1:11434/v1. - Leave the API key field blank unless you put Ollama behind a proxy that enforces keys.
- Under Model ID, enter
gemma3:4b(or the tag you pulled earlier). - Save the configuration.
Optional: adjust model parameters
These parameters can be adjusted directly within the VeloFill extension’s LLM Provider options screen.
- Max tokens: Gemma 3 handles up to 8K tokens comfortably; cap requests to avoid latency spikes.
- Temperature: Start at
0.2for deterministic autofill responses; increase gradually for more creative copy. - System prompt: Reinforce internal style guides or classification instructions to keep outputs on-brand.
Step 4: validate the setup inside VeloFill
- Open a low-risk form (newsletter signup, sandbox CRM, etc.).
- Trigger the VeloFill autofill workflow.
- Watch the status panel—responses should show Gemma 3 as the active model, returning in a couple of seconds.
- If nothing happens, open the browser devtools console and look for network calls to
/v1/chat/completions. A200response confirms the integration is live.
Troubleshooting checklist
- Connection refused: Ensure
ollama serveis running and that firewalls allow port 11434. - Model download slow: Use
ollama pull gemma3:4bon lower-bandwidth connections and upgrade later. - Latency high: Reduce prompt size, lower max tokens, or move the model to a machine with a stronger GPU.
- Prompt privacy: Keep the Ollama host on your internal network. Avoid port forwarding directly to the public internet without authentication.
- CORS errors: Ensure
OLLAMA_ORIGINS="*"is set and Ollama was restarted after setting it. - Model not found: Verify the model is downloaded with
ollama list. The name must match exactly, including the tag (e.g.,gemma3:4b, not justgemma3). - Out of memory: If you see memory errors, try a smaller model (e.g.,
gemma3:4binstead ofgemma3:12b) or close other GPU-intensive applications. Check available VRAM withnvidia-smion Linux/Windows. - Model not loading: Check if the model is running with
ollama ps. If it shows 0% GPU utilization, your model may be falling back to CPU, which is much slower.
Keep iterating
- Layer VeloFill knowledge bases so Gemma 3 can reference your private SOPs.
- Schedule periodic model refreshes when new Gemma 3 builds release through Ollama.
- Pair with LiteLLM if you want a unified gateway for multiple local models.
Need more help? See our troubleshooting guide for additional support, or check the Ollama documentation for model-specific issues.
Need a guided walkthrough?
Our team can help you connect VeloFill to your workflows, secure API keys, and roll out best practices.