Getting parseable output back from an LLM has two halves: shaping the request, then parsing the response. Even with Ollama's format: parameter set, models leak conversational preamble like Sure! Here is the JSON: and invent fields, so parse-and-repair is mandatory.
Input
Message types
Ollama's
chat endpoint
Show archive.org snapshot
takes a messages array with four roles: system, user, assistant and tool. The following message pattern works well for us:
- one
systemmessage with instructions - optional alternating
userandassistantmessages for the chat history - a final
usermessage with the optional retrieved RAG context and the actual question or task
Putting the actual task last primes the model to focus on the action, not the surrounding context. As models are trained on conversational patterns, we avoid consecutive user messages for the RAG context and user question.
Choosing an input format
The final user message has to bundle the retrieved RAG context with the actual question. How you encode that bundle matters: it sets how cleanly the model can separate background from the task. A few options:
Plain text
The easiest format. Raw text works for simple prompts without nested structure:
Here is the relevant context to help answer my next question:
[Insert RAG Context Here]
Based on the context above, please answer this question: [Insert Final Question Here]
An easy iteration to this approach is to use Markdown. Because LLMs are heavily trained on it, using headers gives the model stronger semantic boundaries.
# RAG Context
[Insert RAG Context Here]
# Question
[Insert Final Question Here]
However, neither of those solutions offers a way to fence user content off from your own instructions. Switch to JSON or XML once that starts to bite.
JSON
Battle-tested and familiar to every model, if a bit verbose. Wrap the retrieved context and the user's question into a single JSON object that escapes unbalanced quotes or braces in the dynamic content:
documents = search_results.map.with_index(1) do |result, index|
{
index:,
source: result.file_name,
matches: result.matching_snippets,
}
end
user_message = JSON.generate({
instructions: 'Use the documents as context to answer the question.',
documents:,
question: user_query,
})
XML
Each part of the structure is named by its tag, giving the model stronger semantic anchors than JSON's keys. That extra clarity is part of why XML sometimes works better than JSON, especially with smaller models. It can also be more compact for nested data. CDATA is used to escape dynamic content, NO_DECLARATION to save some tokens:
builder = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do |xml|
xml.prompt do
xml.instructions 'Use the documents as context to answer the question.'
xml.documents do
search_results.each.with_index(1) do |result, index|
xml.document(index:) do
xml.source result.file_name
Array(result.matching_snippets).each do |snippet|
xml.match { xml.cdata snippet }
end
end
end
end
xml.question { xml.cdata user_query }
end
end
user_message = builder.doc.root.to_xml(
save_with: Nokogiri::XML::Node::SaveOptions::NO_DECLARATION,
)
TOON
TOON Show archive.org snapshot is another compact format designed to save tokens. Worth a look on the input side when you are running into long generation times due to inherently large context sizes. Do not ask the model to produce TOON, though: LLMs are not trained on it, and unfamiliar output formats degrade quality more than the token savings buy you.
Input Recommendations
- Match the input format to the output format. Do not make the model translate between two structures in one shot.
- Always escape user content that could break structure
- Try stripping whitespace from the input. Output tokens dominate latency since they are generated sequentially, so compacting the response shape pays off most, but compact input still saves context budget. Note however that sometimes whitespace carries semantics (like indentation), so be ready to revert this change if you notice a quality regression.
Output
Choosing an output format
Both Ollama's and OpenAI's APIs only officially support JSON as structured output. Start with JSON; build an XML response parsing harness only if the JSON results are not good enough.
JSON
Ollama's API takes a format: parameter. Two options:
- set it to
'json'for loose JSON mode. The model will probably respond with JSON, but it might not always follow your instructed schema. - pass a JSON schema instead. This constrains generation token-by-token and is the recommended path for production use.
See https://docs.ollama.com/capabilities/structured-outputs#generating-structured-json-with-a-schema Show archive.org snapshot .
schema = {
type: 'object',
required: %w[reasoning text],
additionalProperties: false,
properties: {
reasoning: { type: 'string' },
text: { type: 'string' },
},
}
Ollama.chat(
model: 'qwen3.5:35b',
format: schema,
messages: [
{ role: 'system', content: 'Respond with JSON: { "reasoning": "...", "text": "..." }' },
{ role: 'user', content: prompt },
],
)
The JSON Schema's
descriptionis not being passed to the model!Ollama uses the schema for grammar-constrained decoding, but the property
descriptionis currently being silently ignored - the model never sees it. If a description matters, also describe the field in your prompt prose or via a worked example.
As of April 2026, Ollama has an
open issue
Show archive.org snapshot
about JSON schema adherence in the Qwen 3.5 and 3.6 series. If you use them, you might be better off inlining the schema definition in the prompt instead of passing it via format:.
XML
Ollama's format: parameter only supports JSON, so request XML via prose in the prompt and parse with Nokogiri afterwards. The sanitize and validate steps below apply the same way; only the parser changes.
Define a few valid examples
Two or three example responses anchor the desired output shape. Place them in a dedicated "examples" block before the user's query, and keep them abstract: concrete-looking values pull the model's attention toward surface details of the example rather than the structural shape you want. The nanonets few-shot guide Show archive.org snapshot has more on this topic.
Sanitize and parse the response
Parse-and-repair is necessary even when format: is set. You may use this simple helper:
def parse_json_content(content)
return {} if content.blank?
clean = content.strip
clean = clean.sub(/\A```(?:json)?\s*/i, '').sub(/\s*```\z/, '')
JSON.parse(clean, decimal_class: BigDecimal)
end
Real responses contain ```json fences, leading Sure! lines, or a trailing comment. Strip the fence and consider retrying 1-3 times if the response is still not valid JSON.
Validate the schema
As we already define a JSON Schema for the LLM, we can re-use it for response validation with the
json-schema
Show archive.org snapshot
gem. Pass the same schema from above and the parsed response:
require 'json-schema'
parsed = parse_json_content(response.message.content)
errors = JSON::Validator.fully_validate(schema, parsed)
raise "LLM response did not match schema: #{errors.join(', ')}" if errors.any?
As with parse failures, you can feed the validation errors back to the LLM and retry.
Output Recommendations
- Pass a JSON schema to
format:, not just'json'. Token-by-token constraints are much stronger than loose mode. - Repeat the "respond with JSON" or "respond with XML" instruction in prompt prose.
format:alone is unreliable on smaller models. - Sanitize before parsing. Strip markdown fences and prose preambles even when
format:is set. - Validate parsed responses against your schema.
- Fail loudly if anything is wrong.