In the previous post, we explored how language models guided generation works and how it can be used to extract data from unstructured text. However, those examples used python frameworks. These days, it may seem that many other programming languages are left behind with all the AI hype happening around python.
Here I’ll use the same example of structured data extraction from the previous post and we’ll see how to achieve the same results with llama.cpp and Ruby.
llama.cpp
Last time I already touched upon llama.cpp, where we used it as a backend for lm-format-enforcer
, though it was used through python wrapper.
I already prised the llama.cpp for its democratization of access to LLMs, that it makes possible to run LLMs everywhere.
Another cool aspect of llama.cpp is that it’s not only huge frameworks for running LLMs, which api allows you to control and tweak many aspects of a text inference, but it also provides a bunch of tools as well, that can be used regardless of the programming language you use.
Setting up
There are several ways to install llama.cpp
, such as using package managers like brew. Alternatively you can build it yourself, or run it via container. The documentation is quite comprehensive, so if you need any additional guidance on installation, I recommend checking it out.
Tools
Once you have it installed, you can see that the package contains a variety of tools. At the time of writing, the package includes the following:
convert_hf_to_gguf.py |
llama-llava-cli |
llama-baby-llama |
llama-lookahead |
llama-batched |
llama-lookup |
llama-batched-bench |
llama-lookup-create |
llama-bench |
llama-lookup-merge |
llama-bench-matmult |
llama-lookup-stats |
llama-cli |
llama-minicpmv-cli |
llama-convert-llama2c-to-ggml |
llama-parallel |
llama-cvector-generator |
llama-passkey |
llama-embedding |
llama-perplexity |
llama-eval-callback |
llama-quantize |
llama-export-lora |
llama-quantize-stats |
llama-gbnf-validator |
llama-retrieval |
llama-gguf |
llama-save-load-state |
llama-gguf-hash |
llama-server |
llama-gguf-split |
llama-simple |
llama-gritlm |
llama-speculative |
llama-imatrix |
llama-tokenize |
llama-infill |
llama-cli
provides LLM text inference functionality as cli tool, and it’s useful for experimentation and debugging, check out the documentation to see all possible options you can control. However, in this this tutorial, we’ll focus on llama-server
.
Constraining output
llama.cpp
has built-in support for guided generation using grammars. If you checked the documentation, you might have noticed that both tools llama-cli
and llama-server
support parameters:
--json-schema SCHEMA
to constrain model output to given schema.--grammar GRAMMAR, --grammar-file FILE
to force model output to follow specific format using BNF type of grammar, we’ll talk about it later.
JSON schema
The JSON schema is the easiest way to constrain model output. It’s useful since json is widely supported across many programming languages and frameworks.
And since I mentioned ruby before let’s see how we can generate it using ruby. I’ll use dry-rb schema for this.
Dry::Schema.load_extensions(:json_schema)
MovieSchema = Dry::Schema.JSON do
:title).filled(:string)
required(:director).filled(:string)
required(:cast).array(:string)
required(end
MovieSchema.json_schema.to_json
Which we’ll give us schema like:
{
"type": "object",
"properties": {
"title": {
"type": "string",
"minLength": 1
},
"director": {
"type": "string",
"minLength": 1
},
"cast": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"title",
"director",
"cast"
]
}
BNF grammar
Another option, as I mentioned, is to provide BNF grammar.
BNF stands for Backus–Naur form, it is a notation used to define a set of rules that define the allowed sequences of characters, originating from formal languages theory. Going deep into BNF details is beyond the scope of this post, so let’s stay practical in this post. For more details, you can refer to the grammar guide in the repository.
BNF grammar is particularly useful when we want the output to follow a structure other than JSON, like YAML. It’s also helpful if we want to have a custom format for functions/tools calling, or if we want the output to follow a certain reasoning scheme, such as Chain of Thought (CoT) or Reasoning and Acting (ReAct).
To keep things consistent with the previous post, we’ll focus on generating JSON. However, for this demo, instead of using the JSON schema option, we’ll use the grammar option to keep the demo generic.
BNF grammar for JSON
Let’s have a look at the BNF grammar for movie JSON object.
movie_json.bnf
root ::= "{" space title-kv "," space director-kv "," space cast-kv "}"
title-kv ::= "\"title\"" space ":" space string
director-kv ::= "\"director\"" space ":" space string
cast-kv ::= "\"cast\"" space ":" space cast-array
cast-array ::= "[" space (string ("," space string)*)? "]" space
string ::= "\"" char* "\"" space
char ::= [^"\\\x7F\x00-\x1F] | [\\] (["\\bfnrt] | "u" [0-9a-fA-F]{4})
space ::= | " " | "\n"
Let’s break it down.
We start with the root
element (on line 1), which is required by llama.cpp
, it defines the entire output. The output should start with an opening brace {
and end with a closing brace }
, and in between, we expect the following fields: title
, director
, and cast
, separated by commas. We also allow some space
between elements, defined on line 8. The space
can be either empty, a single space, or a newline, which aligns with JSON format.
Next, let’s look at the title
element (line 2). This rule defines the field named "title"
, and it’s followed by a colon :
and then a string value. The string
is defined (on line 6) as any sequence of characters enclosed in quotes. And the characters allowed for string are defined by char
rule (on line 7), that allows almost every character excluding special characters.
The director
element (see line 3) follows the same structure as the title
. The field name is "director"
, enclosed in quotes, followed by :
, and then an arbitrary string value in quotes, similar to how the title
rule works.
And finally, the cast
element (line 4) is a bit more complex, since it’s an array of strings. Again we have the field name "cast"
in quotes, followed by :
, and then the cast-array
(defined on line 5). The array starts with an opening square bracket [
and ends with a closing square bracket ]
. Within the array, we can have zero or more string
’s, and if there’s more than one string, they should be separated by commas. The rule allows the array to be empty, meaning it’s valid to have just the brackets with no strings inside.
Note: llama.cpp has a tool for converting JSON schema to bnf grammar in its source code, see here. That might be useful to convert schema to BNF grammar, if you plan to extend it, in case if json schema functionality is not enough.
Putting all together
Running llama-server
The tool that allows us to use LLMs with almost any language is llama-server
(where “almost any” means any language with HTTP support). It is basically an HTTP server built on top of llama.cpp
functionally, providing LLM inference via a set of endpoints.
Now, let’s run llama-server
with the grammar above, and see how we can use it to extract the data.
llama-server -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --grammar-file movie_json.bnf --ctx-size 32768
Here we used following options:
-m
parameter sets model filename, here we use this llama 3.1 model, same as in the previous post.--grammar-file
specifies the BNF grammar file, note that it sets grammar globally, so we don’t need provide it with every request.--ctx-size 32768
reduces the context size to 32k tokens for the demo, from model’s default 128k.
The server will start on a default port 8080
.
Querying the server
We’re going to use the completion endpoint from llama-server. It accepts variety of parameters that allow to control text generation. But the most important parameter is the prompt
, the name speaks for itself.
With a few lines of a pure Ruby code, we can extract movie information from the HTML data.
require 'net/http'
require 'json'
require 'uri'
def build_prompt(html)
<<~PROMPT
Parse the following HTML and give me the result as a JSON object containing the title, director, and cast (as an array of actors).
Here is the HTML:
```html
#{html}
```
Give me JSON only.
PROMPT
end
def send_request(prompt)
= URI('http://localhost:8080/completion')
uri = Net::HTTP.new(uri.host, uri.port)
http
= Net::HTTP::Post.new(uri.path, { 'Content-Type' => 'application/json' })
request .body = { prompt: }.to_json
request
.request(request)
httpend
def parse_response(response)
= JSON.parse(response.body)
parsed_response JSON.parse(parsed_response['content'])
end
= File.read('data.html').strip
data = build_prompt(data)
prompt
= send_request(prompt)
response = parse_response(response)
movie
puts movie
The code is pretty straightforward: it makes a POST request to the completion endpoint of llama-server, which is running locally. And we pass request parameters in a JSON format. I used same prompt and same data as in the previous post. The endpoint returns a lot of information in response, including prompt
, grammar
and, most importantly, the generated output from the model in the content
field. We call JSON.parse
twice, first, we need to parse endpoint response, and second, to parse generated text, as we forced model to produce JSON.
And if you run it, you should see result like
{
"title" => "Toy Story",
"director" => "John Lasseter",
"cast" => [
"Tom Hanks",
"Tim Allen",
"Annie Potts",
"John Ratzenberger",
"Don Rickles",
"Wallace Shawn",
"Jim Varney"
]
}
Bonus: Pure Shell Solution
As you may have noticed, we didn’t actually use many Ruby features, we just used Ruby’s standard net/http
to perform the request, and json
to build the request and parse the response.
So we can do it using pure shell tools curl
and jq
.
curl --request POST \
\
--url http://localhost:8080/completion "Content-Type: application/json" \
--header <(jq -n --arg prompt "$(cat prompt.txt)" '{"prompt": $prompt}') | jq '.content | fromjson' --data @
Final notes
And this is how you can use local LLMs in Ruby (or almost any programming language) with the help of llama.cpp.
One practical note to keep in mind is that running LLM inference as a separate server process might be a better option, than running it in-process. This way, if something goes wrong, it won’t crash your main service.
Also check-out this video for a more hands-on demo.