Vertex AI AI

local LLM inference via Ollama, zero cloud bills

Why

Every Vertex AI API call costs money. Every prompt iteration, every integration test, every debug session burns credits. localgcp proxies your Vertex AI calls to local models running via Ollama. The official google.golang.org/genai SDK works unchanged -- just set the BaseURL to localgcp. No API keys, no quotas, no bills.

"But I can already use Ollama directly?"

You can. Ollama is great. But then your local code uses Ollama's API, and your production code uses the Vertex AI SDK. Two different APIs, two different code paths, two different sets of bugs to find in production.

With localgcp, your production code is your test code:

// Production: hits Vertex AI on Google Cloud
client, _ := genai.NewClient(ctx, prodConfig)
resp, _ := client.Models.GenerateContent(ctx, "gemini-2.5-flash", prompt, nil)

// Local dev: hits localgcp -> Ollama -> Gemma running on your laptop
client, _ := genai.NewClient(ctx, localConfig)  // only BaseURL differs
resp, _ := client.Models.GenerateContent(ctx, "gemini-2.5-flash", prompt, nil)

Same SDK. Same model name. Same response parsing. Same error handling. The only difference is one config line: BaseURL: "http://localhost:8090".

This also means your CI/CD pipeline tests the real integration path (SDK call, response parsing, error handling) without API keys or cloud bills. Stub mode returns deterministic responses, no Ollama needed in CI.

Prerequisites

Install Ollama and pull a model:

$ brew install ollama     # or download from ollama.com
$ ollama serve            # start the Ollama server
$ ollama pull gemma3      # pull a model

Without Ollama running, localgcp falls back to deterministic stub responses (see Stub Mode below).

Configuration

Flag	Default	Description
`--port-vertexai`	`8090`	Vertex AI emulator port
`--ollama-host`	`http://localhost:11434`	Ollama API endpoint
`--vertex-model-map`	(built-in defaults)	Map Vertex model names to Ollama models

$ localgcp up \
    --ollama-host=http://localhost:11434 \
    --vertex-model-map="gemini-2.5-flash=llama3.2,text-embedding-004=nomic-embed-text"

Go SDK example

This is a complete, runnable example using the official google.golang.org/genai SDK:

package main

import (
    "context"
    "fmt"
    "log"

    "google.golang.org/genai"
)

func main() {
    ctx := context.Background()

    client, err := genai.NewClient(ctx, &genai.ClientConfig{
        Project:  "my-project",
        Location: "us-central1",
        Backend:  genai.BackendVertexAI,
        HTTPOptions: genai.HTTPOptions{
            BaseURL: "http://localhost:8090",
        },
    })
    if err != nil {
        log.Fatal(err)
    }

    resp, err := client.Models.GenerateContent(ctx,
        "gemini-2.5-flash",
        genai.Text("Explain quantum computing in one sentence"),
        nil,
    )
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(resp.Candidates[0].Content.Parts[0].Text)
    // Response comes from Ollama (e.g. llama3.2) running locally
}

Key point: The only change from production code is setting BaseURL. Your model name (gemini-2.5-flash) is automatically mapped to the local Ollama model via --vertex-model-map.

Model alias table

Map Vertex AI model names to Ollama models with --vertex-model-map:

Vertex AI Model	Ollama Model	Use Case
`gemini-2.5-flash`	`llama3.2`	Fast text generation
`gemini-2.5-pro`	`gemma3`	Higher quality generation
`gemini-2.0-flash`	`llama3.2`	Legacy model alias
`text-embedding-004`	`nomic-embed-text`	Text embeddings

Override defaults with comma-separated pairs:

$ localgcp up --vertex-model-map="gemini-2.5-flash=mistral,gemini-2.5-pro=llama3.2:70b"

Embeddings example

package main

import (
    "context"
    "fmt"
    "log"

    "google.golang.org/genai"
)

func main() {
    ctx := context.Background()

    client, err := genai.NewClient(ctx, &genai.ClientConfig{
        Project:  "my-project",
        Location: "us-central1",
        Backend:  genai.BackendVertexAI,
        HTTPOptions: genai.HTTPOptions{
            BaseURL: "http://localhost:8090",
        },
    })
    if err != nil {
        log.Fatal(err)
    }

    resp, err := client.Models.EmbedContent(ctx,
        "text-embedding-004",
        genai.Text("What is the meaning of life?"),
        nil,
    )
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Embedding dimensions: %d\n", len(resp.Embeddings[0].Values))
    // Proxied to nomic-embed-text via Ollama
}

Stub mode for CI/CD

When Ollama is not running, localgcp automatically returns deterministic stub responses. This is ideal for CI/CD pipelines that need to test Vertex AI integration code without running a model:

generateContent returns a fixed text response
embedContent returns a fixed-dimension embedding vector
No Ollama installation needed
Responses are deterministic and fast

# CI/CD: just start localgcp, no Ollama needed
$ localgcp up &
$ go test ./...   # Vertex AI calls return stub responses

Limitations

No streaming (streamGenerateContent) -- responses are returned in full
No tool/function calling
No multimodal input (images, audio, video)
No multi-provider backends (OpenAI, Anthropic) -- Ollama only
No system instructions or safety settings

See the roadmap for upcoming features.