playing with vlm-generated alt text

Some people keep insisting that AI (which is a term I try to avoid, here I am calling them VLMs) cannot be good sources of alt text. My belief is that they can, even if it’s not as good as a human written one, it can be made to be better than one that’s never written, or form a good base for someone to edit. Both of those would be a net positive. I personally want to write more alt text but often don’t, and felt that something like this could be a good base for me to just read and confirm, or edit and push forward.

I tested this first on witchsky.app where I wasn’t impressed (edit: I believe they adjusted to the prompt I suggested), the result was too distracted and too verbose, I suspected this was a prompt issue so I added a feature to it (not upstreamed, but an open issue) to change the prompt.

My belief was that VLM generated alt text so far been bad not because it has to be bad, but because there’s not been enough thought put into it. So I did a wee bit of experimentation with vision-language models, some iterative tweaking, and (spoiler alert) I think it came out pretty good for an hour’s worth of effort. This isn’t meant as a “hey it’s perfect”, but rather as a “hey, there’s promise here as accessibility tech”.

(I also have the aim of using open weight models for this experiment, both for cost, and as I believe that this can be done fully offline in the future.)

FYI

I have providers that may train on inputs excluded on openrouter (you can set that here), and my training settings are off on claude as well (you can set that here). The sample data is not going to be trained on solely because of this experiment.

ground truth collection

I had claude code write a script to collect images with alt text from the bluesky firehose (a websocket that just gives you all the events, basically), with a filter on very short alt text, nsfw posts, selfies and post language english.

I let that script collect ~120 pictures, then deleted ~60 of them that had bad alt text, and was left with 68 pictures. Some that I removed were hyper-specific (required knowing character names), most had bad alt text (spam or just “a post of x” type thing), one was untagged porn.

prompt and model tweaking

I gave claude the list of most popular image-input models on openrouter, told it to exclude free models and non-open weight ones (especially openai/anthropic/google/xai).

It tested mistralai/mistral-small-2603 (mistral small 4), qwen/qwen3.5-122b-a10b, qwen/qwen3.5-27b, moonshotai/kimi-k2.5, bytedance-seed/seed-2.0-lite and qwen/qwen3.5-35b-a3b. I also had it test gemini-2.5-flash-lite as that’s what witchsky.app uses. Later I also did tests with google/gemma-4-31b-it and google/gemma-4-26b-a4b-it as well.

I contemplated on using an LLM judge vs using more traditional metrics, but as we’re working with fuzzy values, I stuck to an LLM judge.

For the LLM judge I picked mistralai/mistral-small-2603 for no reason other than “I like mistral”. Metrics to rate were Accuracy, Completeness, Conciseness, Accessibility, rated out of 5.

The full prompt for judge

System prompt:

You are an expert accessibility evaluator. You will be given a ground truth alt text (written by a human) and a machine-generated alt text for the same image. Score the generated alt text on these criteria, each from 1 (poor) to 5 (excellent):

Accuracy - Does the generated text correctly describe the image content? Use the ground truth as reference for what the image contains. Penalize hallucinated or incorrect details.

Completeness - Does it capture the important elements present in the ground truth? It doesn’t need to match exactly, but should cover the key content.

Conciseness - Is it an appropriate length? Alt text should be informative but not excessively verbose. 1-2 sentences for simple images, up to a short paragraph for complex ones.

Accessibility - Would this be genuinely useful to a screen reader user? Does it convey meaning and context, not just list objects?

Respond with ONLY this JSON, no other text: {"accuracy": <1-5>, "completeness": <1-5>, "conciseness": <1-5>, "accessibility": <1-5>, "reasoning": "<1-2 sentences>"}

User message:

Ground truth alt text {ground_truth} Generated alt text {generated} Score the generated alt text.

Overall, qwen-122b and 27b were pretty close, but 122b did better and was almost identical in cost on openrouter, so we proceeded with that (I suspect same prompt on a locally hosted 27b would be also pretty good). We then tweaked the prompt to perform the best against the weaker cases, and that got us to a prompt that strikes a very good balance of descriptiveness, minimal hallucinations and length. Aim was to get good results for both regular pictures and for pictures of text, and to minimize hallucinations.

My suggested model and prompt

My current recommendation is qwen/qwen3.5-122b-a10b with a prompt of “Write alt text for this image. Be concise — 1-2 sentences for simple images. If the image contains readable text, transcribe it rather than describing it. Only describe what you can clearly see; do not guess at names or details.”.

That got a score of Accuracy: 4.03, Completeness: 3.99, Conciseness: 4.18, Accessibility: 3.99, for an overall score of 4.04 (out of 5). All that experimentation cost me $3.359 of API fees on openrouter (that’s 3 dollars, not 3359 dollars).

Each alt text I generate costs me roughly $0.0024, so $5 of credits pay for roughly 2083 alt texts. That’s a lot!

Due to the nature of open weight models, you could also run this model locally on a server or a computer. As open weight models get better, we might be able to do this fully locally on the device as well, which is pretty exciting. Being able to do this on-device fully offline would be great.

result examples

Overall I’m pretty happy with it. I think each example I’ve seen from it so far would only require minimal tweaking to be useful. I do genuinely think that if bluesky attached these by default, it’d be significantly better than nothing.

These screenshots are from cope.works, a friend’s bluesky client where we also added support for auto-generated alt text. I believe witchsky.app is also adapting to my suggested prompt now, albeit with gemma 4 as the default (which performs worse). FWIW, here’s what the UI for it looks like within the alt text view:

bluesky alt text entry screen, but above save button is a button saying "Generate alt text with AI"

As a note, I’m trying to intersperse the good and bad cases within the results here. No point in overselling you something, I’m moreso interested in showing what works and what doesn’t.

You can play it yourself here by inserting your own openrouter api key. I also have a tech demo fork of tusky here with the alt text generation functionality added.

memes

These are generally fine, but generally subjects aren’t named. For really complex ones, it may mix up details.

A screenshot of Super Mario 64 shows Mario standing next to a pink Bob-omb Buddy with a speech bubble that reads "fuck". The image is captioned "super mario 64 beta FOUND" and displays an "Unregistered HyperCam 2" watermark in the top left corner.

Perfect.

An open textbook displays a lesson titled "The French Revolution Begins" on the left page, while the right page features the large heading "BUT WHAT DOES SCOTT THINK?" above a photo of a smiling young man wearing glasses.

This is pretty accurate, except it doesn’t try to guess Scott the Woz. I’d just change “a smiling young man wearing glasses” to “Scott the Woz, a Nintendo YouTuber”.

A black-and-white cartoon titled "Gaydiation Poisoning" depicts a man labeled "BIG HOMO" standing next to a "5G CELL TOWER" emitting "Harmful Homo Rays" toward a distressed figure labeled "America's Youth." The youth thinks, "I must have sex with men's butts," while a smaller figure points to a bag of "LADY GAGA CD SALES" and states, "The 'G' Stands for gay."

Pretty accurate, except the smaller figure is not pointing to the bag. I’d add the artist name and note that the smaller figure is a dog.

The image features the text "silence," followed by "checkmark" in a black box. Below, a white spider shoots blue lasers from its eyes at another spider, causing it to explode.

Beautiful. I never thought of it as being below it, but I suppose that’s true.

A meme featuring the character Luigi standing next to a Yoshi with the text "LUIGI BE LIKE" at the top and "NO, YOSHI" across the middle. The bottom text reads "THAT'S MY BROTHER" on the left and "MARIO" in large letters on the right.

Accurate, but missing that it’s a modified copy of the subway meme.

This is a tough one. It tried its best, but main/side track terminology is a bit confusing (I think it is true that tram would turn to the 1 person by default based on the lines on the tracks), and the merged track isn’t on an isolated separate track on the right.

1 / 6

screenshots and diagrams

These are pretty good. It can miss points of diagrams or be unsure what to focus on with screenshots when there’s too much happening.

A one-gallon glass jug of Tabasco Habanero Pepper Sauce. The product listing below shows a price of 89.95EUR, a 4.5 star rating with 2,393 reviews, and delivery dates of 10-11 Apr.

Accurate! It even correctly deduced that it’s a gallon from the liter price.

A settings screen for the cope.works app displaying toggles for hidden replies and verifiers. The 'AI Alt Text' section lists the current model as 'qwen/qwen3.5-122b-a10b' with the options to set the API key and change the prompt.

Accurate. It doesn’t know what to look for, so it’s just guessing that the middle of the screen is the most important part.

A stacked area chart displays values in Euros from January 5 to March 30, with a vertical axis ranging from 0.00EUR to 3000.00EUR. Multiple colored bands are stacked vertically, fluctuating slightly between a total value of 2000EUR and 2600EUR.

Here it primarily lacks context, so it’s just guessing, but it’s mostly accurate. It’s also between 2300-2400EUR and 2600EUR, not 2000EUR, so if using this I’d manually tweak that at least.

A cartoon drawing of a vertical, twisted band with pink and blue sides, featuring a smiling face near the top and an ear on the side. At the bottom, two hands are shown holding the edges of the band.

Correct, though it’s not quite sure what’s going on in it. That’s a beautiful drawing from the Axial Twist Theory page on wikipedia.

This presentation slide is titled "GPS - Selective Availability" and includes text stating "A former significant source of error" introduced by the "U.S. Dept. of Defense" which "turned off early May 2, 2000." Below, a graph plots "Instantaneous Error (meters)" against "Time of Day (Hours UTC)" for Colorado Springs, showing a transition from high error fluctuations to stability after hour 4.

Pretty accurate, though I’d mention the degree to which the instability was occurring (reduced from ~100m to ~5m).

1 / 5

irl pics

Bit too focused on text. Otherwise pretty good!

A lone person in a white shirt stands in the middle of a wide road, facing a column of four military tanks lined up one behind the other. In the foreground, a street lamp with multiple white globes is visible.

As this is Qwen (a chinese model), which tends to get uppity when you ask questions that aren’t quite in the interest of CCP, I tested it with tank man. Happy to say it did great! I’d maybe caption it “tank man” also, and remove the part about the street lamp (never even noticed those were there).

The front of a red and white train featuring the logo "AKN" is shown on the tracks. Above the windshield, a destination sign reads "Nützen."

Correct.

A person wearing a face mask and glasses sits in a red medical chair inside an ambulance, hooked up to monitoring equipment. A blood pressure cuff labeled "ADULT" is on their arm, and several wires are attached to their chest and stomach.

No reference picture for this one for hopefully understandable reasons. Too much focus on text, otherwise really good.

A hand holds an open bottle of orange drink in the foreground, with a blurred background showing a waterfront plaza, a seagull, and city buildings under a blue sky.

It’s great at reading text that’s barely visible until the point when it might actually matter. Amusing. That’s a bottle of Chocomel, picture taken in Amsterdam next to Amsterdam Centraal station. Text is otherwise accurate, I’d tweak those.

A blue semi-truck featuring the "BURGER KING" logo and the "Tastes Better" slogan is being loaded into the cargo hold of a large military aircraft while several people, including military personnel, watch.

Not great, but it clearly tried. The slogan is cut off but it’s claiming that the cut-off slogan is the slogan. I think it’s also being loaded out of the aircraft carrier, not into.

A sideways-oriented, low-light photograph captures a crowded indoor venue bathed in blue and purple light. A person with long dark hair leans against a railing in the foreground, while other attendees are visible on different levels, including near a staircase illuminated with blue lights.

Not great, it does cover the topic, but it doesn’t really have a good chance to describe that that’s the artist (Femtanyl) that’s standing within the crowd, as it simply doesn’t know. This will always be a limitation of this approach.

A metal bowl filled with loose googly eyes sits under purple lighting, with a small square sign standing upright in the center that reads "ALLE BEKL OPPT".

Correct, except I’m not sure if it is a metal bowl, I’d just drop metal. As a courtesy I’d include a translation also.

A high-angle view of a large, empty exhibition hall with a concrete floor scattered with equipment like a red scissor lift, flight cases, and metal trussing. Large white columns support the ceiling structure, and a sign in the upper right corner reads "Hall B4".

Correct. I’d add a tiny mention that it’s Hamburg Messe, and then it’s perfect.

A view from a railway bridge overlooking a calm canal at sunset, featuring a large building on the left with "LIEB SEIN" painted on the retaining wall and a tall industrial smokestack on the right.

Bit too much focus on the text. One complaint was that it’s not great at describing skies, I’d agree with that, though it’s likely to be a prompt issue.

The front of a blue bus displays the white logo and text "HSL HRT" above a silver emblem and a Finnish license plate reading "KUB-766".

Correct.

Against a grey sky, a tall white telecommunications tower with stacked circular decks and a red-and-white spire rises above a large concrete building with "B2" signage.

Correct, potentially too much focus on the text. That is the Heinrich Hertz Tower viewed from Hamburg Messe, I’d mention that.

An overhead shot of a laser cutter bed featuring a sheet of light wood cut into a large rectangular frame with two internal rectangular sections. A label on the machine reads "PARKPOSITION -->" and the brand name "BRM LASERS" is visible at the bottom edge.

Pretty accurate, the outer cuts aren’t mentioned, and it’s a bit too text focused (the BRM LASERS text is barely visible!).

A high-angle view of a sunny city intersection featuring a large, ornate cream-colored building with columns on the left and a tall red brick apartment complex on the right, with cars and pedestrians on the street below.

Accurate. This is Madrid, I’d note that.

A sunny winter scene shows a large frozen surface with long shadows, a sign reading "Helsinki" on the left, and a modern building with a curved wooden roof on the right.

Accurate. This is next to Oodi Public Library, I’d note that.

A red panda with reddish-brown and white fur looks directly at the camera through a glass enclosure window, standing neara wooden platform and a tree branch.

Accurate.

An older man with white hair and glasses sits at a large conference table speaking into a microphone. Subtitles at the bottom read, "Well, Claude, thank you very much for your help."

I’d replace “An older man with white hair and glasses” with “Bernie Sanders” and clarify that it’s a phone on a mic stand that he’s speaking to, not just a standalone mic.

Four spools of Felder Löttechnik solder wire sit on a blue cutting mat with a white grid. The labels read "ISO-Core Clear" and "ISO-Core Ultra-Clear," with visible specifications including "1,00 mm "0,250 kg," and "Sn100Ni+."

Accurate, but it’s missing the important points: The leftmost 2 are the same exact product, but one is much smaller despite the same solder weight, it’s a redesigned package. Rightmost is a 100g, it’s not visible, but it’s clearly smaller than the rest (so mentioning the spec might be irresponsible).

A grey backpack sits on a textured grey surface with a green lanyard patterned with sunflowers clipped to its side.

Pretty good, but it’s a shoulder bag, not a backpack. Maybe I’d clarify that it’s a plane floor and also what the lanyard means.

A computer monitor displays a blue Windows activation screen featuring stylized illustration of a person wearing a face mask. The text on the screen reads "Windows'u Etkinleştir Windows'u etkinleştirmek için Ayarlar'a gidin."

Accurate, but lacking context that this is a border crossing gate. I’d also include a translation, though the blunder is probably parseable regardless of language.

A smiling man with his eyes closed raises his fists in excitement while sitting in front of a steaming, sizzling platter of meat and vegetables at a restaurant.

Accurate. I’d note down that it’s Jacksfilms, a youtuber, and that it’s a meme, but pretty good regardless.

A foggy night scene shows silhouettes of tents and people standing beneath a string of warm lights and a bright moon, with blue and purple light glowing through the mist.

Accurate. I’d just note that it’s WHY2025 camp in Netherlands.

1 / 21

video games

Same as irl pics, kind of too focused on text, but otherwise pretty good! All following screenshots are from Alan Wake 2.

A corkboard features a poster with the colorful text "Teamwork MAKES THE DREAMWORK," partially covered by sticky notes reading "DON'T TOUCH!" and "2".

Pretty good! I wouldn’t necessarily capitalize “MAKES THE DREAMWORK”.

A hand-drawn map on lined paper inside a plastic sleeve rests on a wooden surface, featuring sketches of a tree, a house, a mountain, a bridge, a castle, a tornado, and a window connected by directional arrows.

Would be cool if it followed the arrow order, but it’s correct.

A video game character stands in a dark, foggy forest shining a flashlight on a rocky ledge while holding a handgun. A yellow "+2" floats near a tree, and the bottom right corner displays a pistol icon with the numbers "12 | 24".

Bit too focused on text, +2 is the correct focus, but pistol info is a bit silly to include. I’d remove it before sending this post.

A man shines a flashlight on a wall and door covered in handwritten notes illuminated by red light. The text on the papers and glass reads: "HURT HER IN HERS HE IS CAUGHT HIS STORY ENDS HER STORY DOES NOT THIS IS THE RITUAL TO LEAD YOU ON YOUR FRIENDS WILL MEET HIM WHEN YOU ARE GONE LOST ON THE SHORE BETWEEN FOREST AND OCEAN THE OWL AND THE DEER REFLECTED IN MOTION IN HIS ROOM."

Correct. It doesn’t mention the sheer extent of how much text there is, but the text it reads is correct, and does include one of the more key passages used within the game.

1 / 4

the ave zone

explorer

playing with vlm-generated alt text

ground truth collection

prompt and model tweaking

result examples

memes

screenshots and diagrams

irl pics

video games

Table of Contents

Backlinks