Can robots.txt reliably block AI scrapers?

Robots.txt can request that compliant bots (including named AI crawlers) do not crawl, and this guide shows blocking known AI user agents; however, malicious scrapers may ignore robots rules, so combine this with server-side protections, rate limiting, and bot detection.

Why did robots.txt return 404 and how do I fix it?

If middleware matcher rewrites file routes, .txt and .xml handlers can be bypassed; exclude file extensions in the middleware matcher (e.g. .*\..*) so App Router route handlers handle robots.txt and sitemap.xml directly.

How does caching affect tenant lookups and freshness?

Using unstable_cache with tags and a revalidate interval caches tenant lookups for performance while allowing hourly revalidation; adjust TTL and cache tags to balance freshness and performance.

Dynamic robots.txt in Next.js for Multi-Tenant Sites

I recently tackled a common challenge in multi-tenant architectures: how to serve unique robots.txt and sitemap.xml files for different domains running on the same application. A static file in /public just doesn't cut it when Tenant A needs to block AI bots while Tenant B wants full indexing.

This guide walks through the robust, cached solution I implemented using Next.js App Router and Payload CMS.

1. The Core Utility: Centralized Tenant Lookups

The first step was to stop repeating ourselves. We needed a single, reliable way to resolve the current tenant from the hostname—whether it's a custom domain (example.com) or a subdomain (tenant.app.com).

I centralized this logic in src/payload/db/index.ts using unstable_cache to keep performance high. This function is the backbone of our SEO strategy.

// File: src/payload/db/index.ts

export const getTenantByDomain = async (domain: string) => {
  return await unstable_cache(
    async () => {
      const payload = await getPayloadClient();
      const tenants = await payload.find({
        collection: "tenants",
        where: {
          or: [
            { domain: { equals: domain } },
            { slug: { equals: domain.split('.')[0] } } // Fallback to slug for subdomain patterns
          ]
        },
        limit: 1,
      });
      return tenants.docs[0] || null;
    },
    [CACHE_KEY.TENANT_BY_DOMAIN(domain)],
    {
      tags: [TAGS.TENANTS],
      revalidate: 3600, // Revalidate every hour
    }
  )();
};

Why this matters: This function handles the heavy lifting of database queries and caching. By centralizing it, we ensure that robots.txt, sitemap.xml, and humans.txt all "agree" on which tenant is active.

2. Dynamic Robots.txt with AI Protection

With the tenant lookup in place, I created a dynamic route handler for robots.txt. This isn't just a static file anymore; it's code. This allows us to inject the correct sitemap URL for the specific tenant and apply global rules, like blocking AI scrapers.

// File: src/app/robots.ts

import type { MetadataRoute } from "next";
import { headers } from "next/headers";
import { getTenantByDomain } from "@/payload/db";

export default async function robots(): Promise<MetadataRoute.Robots> {
  // Get hostname from request headers
  const hostname = (await headers()).get('host') || 'www.adart.com';
  
  // Try to find tenant by domain or subdomain
  const tenant = await getTenantByDomain(hostname);
  
  // If no tenant found, use fallback (adart)
  const baseUrl = tenant?.domain ? \`https://\${tenant.domain}\` : \`https://\${hostname}\`;
  
  return {
    rules: [
      // Block AI Scraping Bots
      {
        userAgent: ["GPTBot", "CCBot", "Google-Extended"],
        disallow: ["/"],
      },
      // Standard bots
      {
        userAgent: "*",
        allow: "/",
        disallow: [
          "/admin",
          "/api",
        ],
        crawlDelay: 1,
      },
    ],
    sitemap: \`\${baseUrl}/sitemap.xml\`,
    host: baseUrl,
  };
}

Key Features:

Dynamic Host: The sitemap link automatically matches the visitor's domain.
AI Blocking: explicit blocks for GPTBot, CCBot, and Google-Extended protecting our content intelligence.

3. Dynamic Humans.txt

To give credit where it's due, I also implemented a humans.txt endpoint. This is a nice touch that adds personality and transparency to the site, dynamically acknowledging the specific tenant.

// File: src/app/humans.ts

import { headers } from "next/headers";
import { getTenantByDomain } from "@/payload/db";

export default async function humans() {
  const hostname = (await headers()).get('host') || '';
  const tenant = await getTenantByDomain(hostname);
  const tenantName = tenant?.name || 'Ad Art';
  
  const content = \`/* TEAM */
  
  Site built by: Ad Art Team
  For: \${tenantName}
  
/* SITE */
  
  Standards: HTML5, CSS3, TypeScript
  Components: Payload CMS, Next.js\`;

  return new Response(content, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}

4. The Critical Fix: Middleware Matcher

This was the tricky part. Even with the files in place, robots.txt was returning a 404.

The culprit was src/middleware.ts. The matcher regex was swallowing requests to files if they didn't match specific patterns. I updated the negative lookahead to explicitly exclude any path with a file extension (like .txt or .xml).

// File: src/middleware.ts

export const config = {
  matcher: [
    /*
     * Match all request paths except for:
     * ...
     * 5. Static files (e.g. /favicon.ico, /robots.txt) - Matched by .*\\..*
     */
    '/((?!api|_next|_static|_vercel|.*\\..*).*)',
  ],
};

The Lesson: If your middleware runs on file routes, it might try to rewrite them to tenant paths (e.g., /tenant-slugs/.../robots.txt), which don't exist. Excluding files from middleware ensures they hit the App Router handlers directly.

Conclusion

By moving away from static files and leveraging Next.js Route Handlers, we've created a SEO infrastructure that allows:

Automatic Sitemaps per tenant.
Smart Indexing Rules that protect against AI scraping.
Zero Maintenance when onboarding new tenants.

Let me know if you have questions!

Thanks, Matija

Dynamic robots.txt in Next.js for Multi-Tenant Sites

⚡ Next.js Implementation Guides

Related Posts:

1. The Core Utility: Centralized Tenant Lookups

2. Dynamic Robots.txt with AI Protection

3. Dynamic Humans.txt

4. The Critical Fix: Middleware Matcher

Conclusion

Frequently Asked Questions

Comments

You might be interested in

Get in Touch