Structured data extraction using LLMs

In this post you’ll learn how to extract structured data from html (or any other textual data)
LLM
llama.cpp
data extraction
json
Published

August 16, 2024

Update 2024-09-12. Added video

Anyone who has ever tried to write a parser to extract data from web pages or emails knows the pain. The way data is rendered on different pages can vary, and the structure may change over time, making it hard to maintain such parsers.

With the rise of Large Language Models (LLMs), it has become possible to just feed HTML directly to an LLM directly and ask for the required data. However, it introduced a new problem: parsing the model’s response. Models can hallucinate, and the output structure can vary from run to run, making it challenging too.

These issues can be partially resolved by carefully crafting prompts, e.g. by including expected formats or response examples into prompt. Or by fine-tuning the model, but it requires preparing a training dataset with correctly parsed examples.

Recently, OpenAI announced support for Structured Outputs in their API. While JSON response support has been available for about a year, the new API features allow for more strict control over the output by enforcing a given schema.

However such functionality is also available for open-source models, e.g. llama.cpp has had the capability to provide a grammar to enforce specific output for quite some time already.

As a big fan of everything related to reproducible and local development, I really like the recent trend of running large language models locally, thanks to model quantization techniques and tools like llama.cpp. It allows you to quickly hack your way through wherever you are using just your laptop.

Note that in this post I’ll explore solutions that also support Apple Silicon, and can be run on recent Mac machines.

The Problem

OK, as mentioned before, we’ll focus on using LLMs for extracting data from HTML. I’ll use here a part of the HTML from the Wikipedia article about Toy Story movie, the movie details card, usually shown on the right side.

Below is an excerpt from the HTML that I’ll use in the demo:

data.html
<table class="infobox vevent"><tbody><tr><th colspan="2" class="infobox-above summary" style="font-size: 125%; font-style: italic;">Toy Story</th></tr><tr><td colspan="2" class="infobox-image"><span class="mw-default-size" typeof="mw:File/Frameless"><a href="/wiki/File:Toy_Story.jpg" class="mw-file-description" title="The poster features Woody anxiously holding onto Buzz Lightyear as he flies into Andy's room. Below them sitting on the bed are Bo Peep, Mr. Potato Head, Troll, Hamm, Slinky, Sergeant, and Rex. In the lower right center of the image is the film's title. The background shows the cloud wallpaper featured in the bedroom."><img alt="The poster features Woody anxiously holding onto Buzz Lightyear as he flies into Andy's room. Below them sitting on the bed are Bo Peep, Mr. Potato Head, Troll, Hamm, Slinky, Sergeant, and Rex. In the lower right center of the image is the film's title. The background shows the cloud wallpaper featured in the bedroom." src="//upload.wikimedia.org/wikipedia/en/thumb/1/13/Toy_Story.jpg/220px-Toy_Story.jpg" decoding="async" width="220" height="328" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/en/1/13/Toy_Story.jpg 1.5x" data-file-width="250" data-file-height="373"></a></span><div class="infobox-caption">Theatrical release poster</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Directed by</th><td class="infobox-data"><a href="/wiki/John_Lasseter" title="John Lasseter">John Lasseter</a></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Screenplay by</th><td class="infobox-data"><style data-mw-deduplicate="TemplateStyles:r1126788409">.mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{line-height:inherit;list-style:none;margin:0;padding:0}.mw-parser-output .plainlist ol li,.mw-parser-output .plainlist ul li{margin-bottom:0}</style><div class="plainlist">
<ul><li><a href="/wiki/Joss_Whedon" title="Joss Whedon">Joss Whedon</a></li>
<li><a href="/wiki/Andrew_Stanton" title="Andrew Stanton">Andrew Stanton</a></li>
<li><a href="/wiki/Joel_Cohen_(writer)" title="Joel Cohen (writer)">Joel Cohen</a></li>
<li><a href="/wiki/Alec_Sokolow" title="Alec Sokolow">Alec Sokolow</a></li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Story by</th><td class="infobox-data"><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1126788409"><div class="plainlist">
<ul><li>John Lasseter</li>
<li><a href="/wiki/Pete_Docter" title="Pete Docter">Pete Docter</a></li>
<li>Andrew Stanton</li>
<li><a href="/wiki/Joe_Ranft" title="Joe Ranft">Joe Ranft</a></li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Produced by</th><td class="infobox-data"><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1126788409"><div class="plainlist">
<ul><li><a href="/wiki/Bonnie_Arnold" title="Bonnie Arnold">Bonnie Arnold</a></li>
<li><a href="/wiki/Ralph_Guggenheim" title="Ralph Guggenheim">Ralph Guggenheim</a></li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Starring</th><td class="infobox-data"><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1126788409"><div class="plainlist">
<ul><li><a href="/wiki/Tom_Hanks" title="Tom Hanks">Tom Hanks</a></li>
<li><a href="/wiki/Tim_Allen" title="Tim Allen">Tim Allen</a></li>
<li><a href="/wiki/Annie_Potts" title="Annie Potts">Annie Potts</a></li>
<li><a href="/wiki/John_Ratzenberger" title="John Ratzenberger">John Ratzenberger</a></li>
<li><a href="/wiki/Don_Rickles" title="Don Rickles">Don Rickles</a></li>
<li><a href="/wiki/Wallace_Shawn" title="Wallace Shawn">Wallace Shawn</a></li>
<li><a href="/wiki/Jim_Varney" title="Jim Varney">Jim Varney</a></li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Edited by</th><td class="infobox-data"><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1126788409"><div class="plainlist">
<ul><li>Robert Gordon</li>
<li><a href="/wiki/Lee_Unkrich" title="Lee Unkrich">Lee Unkrich</a></li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Music by</th><td class="infobox-data"><a href="/wiki/Randy_Newman" title="Randy Newman">Randy Newman</a></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0;">Production<br>company</div></th><td class="infobox-data"><div style="vertical-align: middle;"><a href="/wiki/Pixar_Animation_Studios" class="mw-redirect" title="Pixar Animation Studios">Pixar Animation Studios</a></div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Distributed by</th><td class="infobox-data"><a href="/wiki/Buena_Vista_Pictures_Distribution" class="mw-redirect" title="Buena Vista Pictures Distribution">Buena Vista Pictures Distribution</a><sup id="cite_ref-Disney_1-0" class="reference"><a href="#cite_note-Disney-1"><span class="cite-bracket">[</span>a<span class="cite-bracket">]</span></a></sup></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0; white-space: normal;">Release dates</div></th><td class="infobox-data"><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1126788409"><div class="plainlist film-date">
<ul><li>November&nbsp;19,&nbsp;1995<span style="display:none">&nbsp;(<span class="bday dtstart published updated itvstart">1995-11-19</span>)</span> (<a href="/wiki/El_Capitan_Theatre" title="El Capitan Theatre">El Capitan Theatre</a>)</li>
<li>November&nbsp;22,&nbsp;1995<span style="display:none">&nbsp;(<span class="bday dtstart published updated itvstart">1995-11-22</span>)</span> (United States)</li></ul>
</div></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;"><div style="display: inline-block; line-height: 1.2em; padding: .1em 0; white-space: normal;">Running time</div></th><td class="infobox-data">81 minutes<sup id="cite_ref-Runtime_2-0" class="reference"><a href="#cite_note-Runtime-2"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Country</th><td class="infobox-data">United States</td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Language</th><td class="infobox-data">English</td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Budget</th><td class="infobox-data">$30&nbsp;million<sup id="cite_ref-Numbers_3-0" class="reference"><a href="#cite_note-Numbers-3"><span class="cite-bracket">[</span>2<span class="cite-bracket">]</span></a></sup></td></tr><tr><th scope="row" class="infobox-label" style="white-space: nowrap; padding-right: 0.65em;">Box office</th><td class="infobox-data">$394.4&nbsp;million<sup id="cite_ref-BOXMOJO_4-0" class="reference"><a href="#cite_note-BOXMOJO-4"><span class="cite-bracket">[</span>3<span class="cite-bracket">]</span></a></sup></td></tr></tbody></table>

As you can see, this HTML is a bit messy, it contains inline styles, a mix of table and list formatting, and a lot of unnecessary data, like alt text for the poster, etc.

Let’s use LLMs to extract title, director and cast members.

Tools

There are a number of options for structured and guided generation, e.g. already mentioned built-in support for grammars in llama.cpp.

Speaking of the python ecosystem, we’ll have a look at these two packages: lm-format-enforcer (LMFE) and outlines.

Before starting

Before we start with those libraries, let’s define some common elements that we’ll use in both cases.

First let’s define a simple prompt, which we’ll use in all examples:

data = open("data.html", "r").read()

prompt = """
Parse following html and give me result as json object containing title, director, cast (as an array of actors).
here is html:
```html
{data}
```
give me json only
""".format(data=data).strip()

Before moving to structured generation, go ahead and try to run this prompt with any model as it is, and check the results.

Next, let’s define our data model using pydantic, it allows to easily define a schema in a declarative way.

from typing import List
from pydantic import BaseModel

class MovieData(BaseModel):
    title: str
    director: str
    cast: List[str]

Once the data class is defined we can get a json schema by calling MovieData.model_json_schema(), that will return us a dict like:

{
  'properties': {
    'title': {'title': 'Title', 'type': 'string'},
    'director': {'title': 'Director', 'type': 'string'},
    'cast': {
      'items': {'type': 'string'},
      'title': 'Cast', 'type': 'array'
    }
  },
  'required': ['title', 'director', 'cast'],
  'title': 'MovieData',
  'type': 'object'
}

lm-format-enforcer

Let’s look at the lm-format-enforcer first. It supports multiple backends for inference, including llama.cpp via llama-cpp-python, the one we’ll use here.

Installation

If you’re using a Mac with Apple Silicon, you might want to enable Metal support for llama-cpp-python to speed up inference. To do so, install it with the following command:

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

for details, see documentation.

And for lm-format-enforcer, there are no special tricks, simply run

pip install lm-format-enforcer

Usage

First let’s create a model:

from llama_cpp import Llama

llm = Llama(
    model_path='/patth/to/model.gguf',
    n_ctx=32786,
    n_gpu_layers=-1,
)

Here I set the path to model directly, since I already have the model downloaded locally, I used this 8-bit LLama 3.1 8B Instruct for experiments.

Next, let’s create logits processor to filter tokens generated by model, that’s basically the key piece for structured generation.

from llama_cpp import LogitsProcessorList
from lmformatenforcer.integrations.llamacpp import build_token_enforcer_tokenizer_data, build_llamacpp_logits_processor

tokenizer_data = build_token_enforcer_tokenizer_data(llm)
character_level_parser = JsonSchemaParser(MovieData.model_json_schema())

logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(tokenizer_data, character_level_parser)])

And now we can use it to extract the data:

output = llm(prompt, logits_processor=logits_processors, max_tokens=2048)
result = output['choices'][0]['text']
print(result)

You should get something like this (the output is formatted for readability):

{
  "title": "Toy Story",
  "director": "John Lasseter",
  "cast": [
    "Tom Hanks",
    "Tim Allen",
    "Annie Potts",
    "John Ratzenberger",
    "Don Rickles",
    "Wallace Shawn",
    "Jim Varney"
  ]
}

Outlines

Now let’s check out outlines. While outlines supports llama.cpp too, I encountered a bug that causes runtime errors, so I couldn’t get it to work at the time of writing.

Good news is that outlines supports mlx as a backend, so we’ll use it instead.

Btw, don’t forget to install it first

pip install mlx

Outlines integrates nicely with pydantic, so we can use same MovieData class definition. we’ll also use the same 8bit Llama 3.1 8B Instruct model (with the same prompt) for consistency.

from outlines import models
from outlines import generate

model = models.mlxlm("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit")

generator = generate.json(model, MovieData)
result = generator(prompt)

The result will already be a MovieData object:

MovieData(title='Toy Story', director='John Lasseter', cast=['Tom Hanks', 'Tim Allen', 'Annie Potts', 'John Ratzenberger', 'Don Rickles', 'Wallace Shawn', 'Jim Varney'])

The only downside is that the generator object is not reusable, if you try to call generator(prompt) twice in a row, it’ll raise an error. However, the good news is that outlines has internal caching for state machines, so creating a new generator every time for the same schema doesn’t add much overhead.

Ok, but what’s inside?

Guided generation is an interesting topic on its own, and probably deserves a dedicated post.

In short, at each generation step, we can update the output token probabilities, before sampling a token, by keeping only tokens that are allowed by grammar at this step, and setting probabilities of other tokens to zero.

Final Notes

In conclusion, I’d like to point out that both methods worked well on a tests I ran. In terms of performance, I didn’t notice much differences between the both approaches based on basic wall-clock benchmarking.

However, LMFE seems to be more feature rich, making it the better choice if you don’t plan to run generation on Apple Silicon, given issues with llama.cpp in outlines.

If you’re interested in exploring this topic further, run your own experiments with more complex data, e.g. how well does it work well with long documents?

One topic that we left out is the native llama.cpp grammar support. It might be a viable approach for data extraction, and worth checking out too, but it could be a story for another time.

One more note on the use of JSON format. Though it’s a default format for internet communication, its use for structured data is questionable. E.g. when GPT3 API became available, it was better at generating YAML documents, rather than JSON. This isn’t surprising, since YAML is more readable and text-oriented. On average, YAML-formatted data contains fewer tokens than the same data in JSON format, saving compute time generating less tokens. Although JSON can be minimized by removing all indentation and whitespaces to make it a single line, it still doesn’t always match YAML’s efficiency.

Bonus — Video

Almost the same, but as a video.