Mastering llms.txt: Advanced Next.js 15 Implementation

Last month I shared why llms.txt matters in llms.txt Blueprint: Give AI Crawlers Instant Access and walked through the baseline pipeline in Implementing llms.txt in Next.js 15 with Sanity CMS. That foundation gave every article a Markdown twin, a discoverable llms.txt manifest, and sitemap/robots alignment. After shipping it in production I still ran into a familiar problem: LLM crawlers could find my Markdown, but they had no structured metadata, no machine index, and no automated way to stay fresh. This follow-up is the second phase—everything I added to make the pipeline resilient, self-updating, and consumable by RAG systems without manual intervention.

I’ll assume you already have the first phase in place (Sanity posts with markdownContent, cached helpers, /blog/md/[slug], and /llms.txt). Now we’re going deeper: structured LLM metadata in Sanity, richer Markdown exports, machine-readable manifests, and tooling that validates or backfills the data automatically.

Extend Sanity with LLM-Focused Metadata

The first upgrade lives in the CMS. I needed every post to declare its intent, goal, difficulty, prereqs, outputs, and machine-friendly summaries. Instead of hard-coding that later, I pushed it into the schema so authors can manage it where the content lives.

// File: src/lib/sanity/schemaTypes/postType.ts
defineField({
  name: 'llmIntent',
  title: 'Primary LLM Intent',
  type: 'string',
  options: {
    list: [
      { title: 'How-To', value: 'how-to' },
      { title: 'Reference', value: 'reference' },
      { title: 'Case Study', value: 'case-study' },
      { title: 'Strategy', value: 'strategy' },
      { title: 'Release Notes', value: 'release-notes' },
      { title: 'Troubleshooting', value: 'troubleshooting' },
    ],
  },
  description: 'Classify the article so LLM agents understand the content shape.',
}),
defineField({
  name: 'llmSummaryTriples',
  title: 'LLM Summary Triples',
  type: 'array',
  of: [{
    type: 'object',
    fields: [
      defineField({ name: 'subject', type: 'string', validation: (Rule) => Rule.required() }),
      defineField({ name: 'predicate', type: 'string', validation: (Rule) => Rule.required() }),
      defineField({ name: 'object', type: 'string', validation: (Rule) => Rule.required() }),
    ],
  }],
  description: 'Structured key facts in (subject, predicate, object) form for deterministic extraction.',
}),
defineField({
  name: 'llmApiPrompts',
  title: 'LLM API Prompts',
  type: 'array',
  of: [{
    type: 'object',
    fields: [
      defineField({ name: 'question', type: 'text', rows: 2, validation: (Rule) => Rule.required() }),
      defineField({ name: 'answer', type: 'text', rows: 4, validation: (Rule) => Rule.required() }),
      defineField({ name: 'confidence', type: 'number', validation: (Rule) => Rule.min(0).max(1) }),
    ],
  }],
  description: 'Pre-baked Q/A snippets agents can return when queries align with this post.',
}),

I repeated the pattern for audienceLevel, frameworkVersions, contentStatus, validatedAt, llmGoal, llmPrerequisites, and llmOutputs. Each field has validation rules and editor-facing descriptions so authors know how to fill them out. After running pnpm generate:types, every query is type-safe and ready for consumption.

Push Metadata Through the Cache Layer

With new schema fields in place, the Sanity cache helpers needed to expose them. Expanding the existing query keeps downstream routes and manifests in sync with a single fetch.

// File: src/lib/sanity/queries/queries.ts
export const MARKDOWN_POSTS_QUERY = defineQuery(`*[_type == "post" && defined(slug.current)] | order(publishedAt desc) {
  _id,
  title,
  slug,
  publishedAt,
  dateModified,
  _updatedAt,
  excerpt,
  keywords,
  audienceLevel,
  frameworkVersions,
  contentStatus,
  validatedAt,
  llmIntent,
  llmGoal,
  llmPrerequisites,
  llmOutputs,
  llmSummaryTriples[]{
    subject,
    predicate,
    object
  },
  llmApiPrompts[]{
    question,
    answer,
    confidence
  },
  "hasMarkdown": defined(markdownContent) && markdownContent != "",
  "categories": categories[]->{
    title,
    slug,
    llmLabel,
    llmDescription
  },
  "primaryCategory": categories[0]->{
    title,
    slug,
    llmLabel,
    llmDescription
  },
  steps[]{
    name,
    text
  }
}`)

The cached helper in src/lib/sanity/post-cache.ts now returns MarkdownPostSummary objects with every LLM field, so any route can call getMarkdownPosts() and get the enriched dataset from cache instead of hitting Sanity repeatedly.

Emit Rich Front Matter and Machine Tags in Markdown

Phase one simply echoed the markdown body. The upgraded builder creates a complete document for agents—YAML front matter, summary triples, goal/prereq sections, and a JSON payload for Q/A snippets.

// File: src/lib/llm/markdown.ts
export function buildMarkdownDocument(post: MarkdownReadyPost): string {
  const frontMatter = buildYamlFrontMatter({
    title,
    slug,
    published: publishedDate,
    updated: updatedDate,
    validated: validatedDate,
    categories: categoryLabels,
    tags: keywords,
    'llm-intent': intent,
    'audience-level': audienceLevel,
    'framework-versions': frameworkVersions,
    status: post.contentStatus,
    'llm-purpose': goalStatement,
    'llm-prereqs': prerequisites,
    'llm-outputs': llmOutputs,
  })

  lines.push(frontMatter, '')
  lines.push('**Summary Triples**')
  summaryTriples.forEach((triple) => lines.push(triple))
  lines.push('', '### {GOAL}', effectiveGoal, '', '### {PREREQS}')
  prerequisites.length
    ? prerequisites.forEach((item) => lines.push(`- ${sanitizeSingleLine(item)}`))
    : lines.push('- Familiarity with the concepts discussed in this article.')
  lines.push('', '### {STEPS}')
  stepLines.forEach((entry) => lines.push(entry))
  lines.push('')
  machineTags.forEach((tag) => lines.push(tag))
  lines.push('', `# ${title}`)
  // …rest of body and LLM response snippet…
}

The route at src/app/(non-intl)/blog/md/[slug]/route.ts now imports this builder and returns a fully annotated markdown file. Crawlers get context, triples, machine tags, and a JSON snippet without post-processing, and human readers still see the original Markdown body.

Publish Machine-Readable Manifests and Corpus Dumps

LLM ingest pipelines love having a single place to pull structured entries. I added two new routes that piggyback on the same cache helper.

// File: src/app/(non-intl)/blog/md/index.json/route.ts
export async function GET() {
  const posts = await getMarkdownPosts()

  const manifest = posts.map((post) => ({
    slug,
    title,
    url: `${siteOrigin}/blog/md/${slug}`,
    publishedAt,
    updatedAt,
    validatedAt,
    status: post.contentStatus,
    intent: post.llmIntent,
    audienceLevel: post.audienceLevel ?? post.difficulty,
    goal: post.llmGoal,
    categories,
    tags,
    frameworkVersions,
    prerequisites,
    outputs,
  }))

  return Response.json(manifest, {
    headers: { 'Cache-Control': `public, max-age=0, s-maxage=${revalidate}` },
  })
}

// File: src/app/(non-intl)/llm/corpus.ndjson/route.ts
export async function GET() {
  const summaries = await getMarkdownPosts()
  const records: string[] = []

  for (const summary of summaries) {
    const slug = resolveSlugValue(summary.slug)
    if (!slug) continue

    const post = await getPostBySlug(slug)
    if (!post?.markdownContent) continue

    const markdown = buildMarkdownDocument(post as MarkdownReadyPost)

    records.push(JSON.stringify({
      slug,
      title: post.title,
      url: `${siteOrigin}/blog/md/${slug}`,
      intent: post.llmIntent,
      audienceLevel: post.audienceLevel ?? post.difficulty,
      status: post.contentStatus,
      publishedAt: formatIsoDate(post.publishedAt),
      updatedAt: formatIsoDate(post.dateModified ?? post._updatedAt ?? post.publishedAt),
      validatedAt: formatIsoDate(post.validatedAt),
      categories,
      tags,
      frameworkVersions,
      prerequisites,
      outputs,
      goal: post.llmGoal,
      summaryTriples,
      responses: prompts,
      body: markdown,
    }))
  }

  return new Response(records.join('\n'), {
    headers: {
      'Content-Type': 'application/x-ndjson; charset=utf-8',
      'Cache-Control': `public, max-age=0, s-maxage=${revalidate}`,
    },
  })
}

The first route gives you a compact JSON manifest, the second streams the full NDJSON corpus with Markdown included. Both are statically cached, so crawlers can slurp everything with a single request.

Expose Structured Service Specs

Product pages needed the same treatment. Instead of hardcoding pricing or model details, I centralized them and published JSON specs alongside the human pages.

// File: src/data/service-specs.ts
export const serviceSpecs = [
  {
    slug: 'web-app-development',
    title: 'Productized Web App Development',
    url: 'https://buildwithmatija.com/services/web-app-development',
    specPath: '/services/web-app-development/spec.json',
    summary: 'Fractional CTO partnership to scope, build, and operate complex web applications.',
    pricingModel: 'retainer',
    pricingNotes: 'Monthly partnership starting at €4.5k with minimum 4-week engagement.',
    engagementModel: 'Hands-on fractional CTO and engineering lead delivering sprint-based outcomes.',
    sla: 'Weekly roadmap reviews, 1 business day response times, production hotfix within 12 hours.',
    useCases: ['Ship investor-ready MVPs with production-quality foundations', 'Stabilize or refactor aging Next.js/Node stacks', 'Automate internal workflows with custom portals and APIs'],
    deliverables: ['Technical architecture & deployment plan', 'Production-ready Next.js/TypeScript implementation', 'CI/CD automation with observability hooks', 'Operational handbook for handover'],
    techStack: ['Next.js 15', 'TypeScript', 'Prisma', 'PostgreSQL', 'Vercel', 'Sanity CMS'],
    contactCta: 'https://buildwithmatija.com/contact',
  },
  // …other services…
]

Each spec route simply wraps that payload with cache headers (src/app/(non-intl)/services/[slug]/spec.json/route.ts and src/app/(non-intl)/mvp/spec.json/route.ts). The main landing page pulls from the same dataset, so the human view and machine endpoint share one source of truth.

Let Crawlers Know About the New Endpoints

robots.ts now whitelists the manifest, corpus, and spec routes for both generic bots and popular LLM crawlers.

// File: src/app/robots.ts
const llmAgents = ['GPTBot', 'ClaudeBot', 'anthropic-ai', 'PerplexityBot', 'Googlebot']

return {
  sitemap: `${baseUrl}/sitemap.xml`,
  rules: [
    {
      userAgent: '*',
      allow: [
        '/',
        '/llms.txt',
        '/blog/md/index.json',
        '/llm/corpus.ndjson',
        '/services/web-app-development/spec.json',
        '/services/seo-friendly-websites/spec.json',
        '/services/single-purpose-tools/spec.json',
        '/mvp/spec.json',
      ],
      disallow: ['/studio', '/api/', '/wp-admin'],
    },
    ...llmAgents.map((userAgent) => ({
      userAgent,
      allow: [
        '/',
        '/llms.txt',
        '/blog/md/index.json',
        '/llm/corpus.ndjson',
        '/services/web-app-development/spec.json',
        '/services/seo-friendly-websites/spec.json',
        '/services/single-purpose-tools/spec.json',
        '/mvp/spec.json',
      ],
    })),
  ],
}

Between this and the upgraded sitemap, every machine entry point is obvious: llms.txt, the JSON manifest, the NDJSON corpus, and the service specs.

Automate Backfill and Validation

The last mile was operational. Structured fields only help if they’re always filled in, so I added two scripts.

The validator fails the build when a post is missing intent, goal, triples, prompts, framework versions, prerequisites, or outputs.

// File: scripts/validate-llm-content.ts
const records = await client.fetch<MarkdownPostSummary[]>(MARKDOWN_POSTS_QUERY)
records.forEach((post) => {
  const slug = resolveSlugValue(post.slug) ?? '(missing-slug)'

  if (!ensureString(post.llmIntent)) issues.push({ slug, message: 'llmIntent is missing.' })
  if (!ensureString(post.llmGoal)) issues.push({ slug, message: 'llmGoal is missing.' })
  if (!ensureString(post.contentStatus)) issues.push({ slug, message: 'contentStatus is missing.' })
  if (!ensureString(post.validatedAt)) issues.push({ slug, message: 'validatedAt is missing.' })

  if (!Array.isArray(post.llmPrerequisites) || post.llmPrerequisites.length === 0)
    issues.push({ slug, message: 'llmPrerequisites list is empty.' })
  if (!Array.isArray(post.frameworkVersions) || post.frameworkVersions.length === 0)
    issues.push({ slug, message: 'frameworkVersions list is empty.' })
  if (!Array.isArray(post.llmSummaryTriples) || post.llmSummaryTriples.length === 0)
    issues.push({ slug, message: 'Missing llmSummaryTriples entries.' })
  if (!Array.isArray(post.llmApiPrompts) || post.llmApiPrompts.length === 0)
    issues.push({ slug, message: 'Missing llmApiPrompts entries.' })
})

pnpm validate:llm runs it with tsx, so CI or local builds fail fast when something’s missing.

For gaps, the backfill script uses GPT-5-mini to generate the metadata. You can run it in dry mode or force regeneration when you want fresh triples.

pnpm backfill:llm --dry-run --force --slug migrate-docker-containers-between-vps
pnpm backfill:llm --limit 10
pnpm backfill:llm

For each post it sends a truncated summary and gets back intent, goal, frameworks, prerequisites, outputs, triples, and Q/A. It also auto-adds _key values so Sanity Studio can edit the arrays immediately. If the model leaves anything blank, the script falls back to existing metadata or inferred defaults (tools become frameworks, goal surfaces as fallback output, difficulty maps to audience level). Combined with the validator, this keeps the whole corpus consistent without a weekly manual audit.

Wrap It Up with Documentation

The src/app/llms.txt/README.md file now documents the whole pipeline—Sanity schema fields, manifest routes, NDJSON corpus, backfill command, and validation checklist. Having historical memory in the repo helps new contributors understand why each piece exists and how to extend it safely.

Conclusion

The first phase gave us Markdown mirrors and llms.txt; the second phase makes the entire pipeline structured, discoverable, and self-maintaining. Sanity stores the metadata, Next.js renders richer Markdown, manifests expose machine-friendly snapshots, service specs have JSON siblings, robots/sitemaps broadcast the endpoints, and tooling backfills or validates everything automatically. By the end of this guide you can let crawlers grab a coherent corpus—Markdown front matter, triples, JSON snippets, even pricing data for services—without scraping or manual exports.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija