ExchangeDEX+

Buy Crypto Markets Spot Futures500X Earn Event Center

2025 Recap

Like humans, LLMs generate sloppy code over time - just faster. Learn how to use multi-model reviews and formal code analysis to ensure code quality.Like humans, LLMs generate sloppy code over time - just faster. Learn how to use multi-model reviews and formal code analysis to ensure code quality.

Can LLMs Generate Quality Code? A 40,000-Line Experiment

Author: Hackernoon

Source: Hackernoon

2026/01/05 03:00

LIKE$0.003092+1.07%

FASTER$0.0001891-15.95%

LEARN$0.0123+3.70%

MULTI$0.03622-4.80%

Executive Summary

I spent four weeks part-time (probably 80 hours total) building a complete reactive UI framework with 40+ components, a router, and supporting interactive website using only LLM-generated code, it is evident LLMs can produce quality code—but like human developers, they need the right guidance.

Key Findings

On Code Quality:

Well-specified tasks yield clean first-pass code
Poorly specified or unique requirements produce sloppy implementations
Code degrades over time without deliberate refactoring
LLMs defensively over-engineer when asked to improve reliability

On The Development Process:

It is hard to be “well specified” when a task is large
Extended reasoning ("thinking") produces better outcomes, though sometimes leads to circular or overly expansive logic
Multiple LLM perspectives (switching models) provides valuable architectural review and debug assistance
Structured framework use, e.g. Bau.js or Lightview, prevent slop better than unconstrained development
Formal metrics objectively identify and guide removal of code complexity

The bottom line: In many ways LLMs behave like the average of the humans who trained them—they make similar mistakes, but much faster and at greater scale. Hence, you can get six months of maintenance and enhancement “slop” 6 minutes after you generate an initial clean code base and then ask for changes.

The Challenge

Four weeks ago, I set out to answer a question that's been hotly debated in the development community: Can LLMs generate substantive, production-quality code?

Not a toy application. Not a simple CRUD app. A complete, modern reactive UI framework with lots of pre-built components, a router and a supporting website with:

Tens of thousands of lines of JavaScript, CSS, and HTML
Memory, performance, and security considerations
Professional UX and developer experience

I chose to build Lightview (lightview.dev)—a reactive UI framework combining the best features of Bau.js, HTMX, and Juris.js. The constraint: 100% LLM-generated code using Anthropic's Claude (Opus 4.5, Sonnet 4.5) and Google's Gemini 3 Pro (Flash was not released when I started).

Starting With Questions, Not Code

I began with Claude Opus:

Claude didn't initially dive into code. It asked dozens of clarifying questions:

TypeScript or vanilla JavaScript?
Which UI component library for styling? (It provided options with pros/cons)
Which HTMX features specifically?
Hosting preferences?
Routing strategy?

However, at times it started to write code before I thought it should be ready and I had to abort a response and redirect it.

After an hour of back-and-forth, Claude finally said: "No more questions. Would you like me to generate an implementation plan since there will be many steps?"

The resulting plan was comprehensive - a detailed Markdown file with checkboxes, design decisions, and considerations for:

Core reactive library
40+ UI components
Routing system
A website … though the website got less attention - a gap I'd later address

I did not make any substantive changes to this plan except for clarification on website items and the addition of one major feature, declarative event gating at the end of development.

The Build Begins

With the plan in place, I hit my token limit on Opus. No problem—I switched to Gemini 3 (High), which had full context from the conversation plus the plan file.

Within minutes, Gemini generated lightview.js—the core reactivity engine—along with two example files: a "Hello, World!" demo showing both Bau-like syntax and vDOM-like syntax.

Then I made a mistake.

"Build the website as an SPA," I said, without specifying to use Lightview itself. I left for lunch.

When I returned, there was a beautiful website running in my browser. I looked at the code and my heart sank: React with Tailwind CSS.

Worse, when I asked to rebuild it with Lightview, I forgot to say "delete the existing site first." So it processed and modified all 50+ files one by one, burning through tokens at an alarming rate.

The first is often more token-efficient for large changes. The second is better for targeted fixes. The LLM won't automatically choose the efficient path—you need to direct it.

The Tailwind Surprise

One issue caught me off-guard. After Claude generated the website using Lightview components, I noticed it was still full of Tailwind CSS classes. I asked Claude about this.

"Well," Claude effectively explained, "you chose DaisyUI for the UI components, and DaisyUI requires Tailwind as a dependency. I assumed you were okay with Tailwind being used throughout the site."

Fair point—but I wasn't okay with it. I prefer semantic CSS classes and wanted the site to use classic CSS approaches.

I asked Claude to rewrite the site using classic CSS and semantic classes. I liked the design and did not want to delete the files, so once again I suffered through a refactor that consumed a lot of tokens since it touched so many files. I once again ran out of tokens and tired GPT-OSS bit hit syntax errors and had to switch to another IDE to keep working.

The Iterative Dance

Over the next few weeks, I worked to build out the website and test/iterate on components, I worked across multiple LLMs as token limits reset. Claude, Gemini, back to Claude. Each brought different strengths and weaknesses:

Claude excelled at architectural questions and generated clean website code with Lightview components
Gemini Pro consistently tried to use local tools and shell helper scripts to support its own work—valuable for speed and token efficiency. However, it sometimes failed with catastrophic results, many files zeroed out or corrupt with no option but to roll-back.
Switching perspectives proved powerful: "You are a different LLM. What are your thoughts?" often yielded breakthrough insights or rapid fixes to bugs on which one LLM has been spinning.
I found the real winner to be Gemini Flash. It did an amazing job of refactoring code without introducing syntax errors and needed minimal guidance on what code to put where. Sometimes I was skeptical of a change and would say so. Sometimes, Flash would agree and adjust and other times it would make a rational justification of its choice. And, talk about fast … wow!

The Router Evolution

The router also needed work. Claude initially implemented a hash-based router (#/about, #/docs, etc.). This is entirely appropriate for an SPA—it's simple, reliable, and doesn't require server configuration.

But I had additional requirements I hadn't clearly stated: I wanted conventional paths (/about, /docs) for deep linking and SEO. Search engines can handle hash routes now, but path-based routing is still cleaner for indexing and sharing.

When I told Claude I needed conventional paths for SEO and deep linking, it very rapidly rewrote the router and came up with what I consider a clever solution—a hybrid approach that makes the SPA pages both deep-linkable and SEO-indexable without the complexity of server-side rendering. However, it did leave some of the original code in place which kind of obscured what was going on and was totally un-needed. I had to tell it to remove this code which supported the vestiges of hash-based routes. This code retention is the kind of thing that can lead to slop. I suppose many people would blame the LLM, but if I had been clear to start with and also said “completely re-write”, my guess is the vestiges would not have existed.

Confronting The Numbers

The Final Tally

Project Size:

60 JavaScript files, 78 HTML files, 5 CSS files
41,405 total lines of code (including comments and blanks)
Over 40 custom UI components
70+ website files

At this point, files seemed reasonable - not overly complex. But intuition and my biased feelings about code after more than 40 years of software development isn't enough. I decided to run formal metrics on the core files.

Core Libraries:

| File | Lines | Minified Size | |----|----|----| | lightview.js | 603 | 7.75K | | lightview-x.js | 1,251 | 20.2K | | lightview-router.js | 182 | 3K |

The website component gallery scored well on Lighthouse for performance without having had super focused optimization.

\ But then came the complexity metrics.

The Slop Revealed

I asked Gemini Flash to evaluate the code using three formal metrics:

1. Maintainability Index (MI): A combined metric where 0 is unmaintainable and 100 is perfectly documented/clean code. The calculation considers:

Halstead Volume (measure of code size and complexity)
Cyclomatic Complexity
Lines of code
Comment density

Scores above 65 are considered healthy for library code. This metric gives you a single number to track code health over time.

2. Cyclomatic Complexity: An older but still valuable metric that measures the number of linearly independent paths through code. High cyclomatic complexity means:

More potential bugs
Harder to test thoroughly (the metric can actually tell you how many you might need to write)
More cognitive load to understand

3. Cognitive Complexity: A modern metric that measures the mental effort a human needs to understand code. Unlike cyclomatic complexity (which treats all control flow equally), cognitive complexity penalizes:

Nested conditionals and loops (deeper nesting = higher penalty)
Boolean operator chains
Recursion
Breaks in linear flow

The thresholds:

0-15: Clean Code - easy to understand and maintain
16-25: High Friction - refactoring suggested to reduce technical debt
26+: Critical - immediate attention needed, maintenance nightmare

The Verdict

Overall health looked good:

| File | Functions | Avg Maintainability | Avg Cognitive | Status | |----|----|----|----|----| | lightview.js | 58 | 65.5 | 3.3 | ⚖️ Good | | lightview-x.js | 93 | 66.5 | 3.6 | ⚖️ Good | | lightview-router.js | 27 | 68.6 | 2.1 | ⚖️ Good |

But drilling into individual functions told a different story. Two functions hit "Critical" status:

handleSrcAttribute (lightview-x.js):

Cognitive Complexity: 35 🛑
Cyclomatic Complexity: 22 🛑
Maintainability Index: 33.9

Anonymous Template Processing (lightview-x.js):

Cognitive Complexity: 31 🛑
Cyclomatic Complexity: 13

This was slop. Technical debt waiting to become maintenance nightmares.

Can AI Fix Its Own Slop?

Here's where it gets interesting. The code was generated by Claude Opus, Claude Sonnet, and Gemini 3 Pro several weeks earlier. Could the newly released Gemini 3 Flash clean it up?

I asked Flash to refactor handleSrcAttribute to address its complexity. This seemed to take a little longer than necessary. So I aborted and spent some time reviewing its thinking process. There were obvious places it got side-tracked or even went in circles, but I told it to continue. After it completed, I manually inspected the code and thoroughly tested all website areas that use this feature. No bugs found.

After the fixes to handleSrcAttribute, I asked for revised statistics to see the improvement.

Flash's Disappearing Act

Unfortunately, Gemini Flash had deleted its metrics-analysis.js file! It had to recreate the entire analyzer.

The Dev Dependencies Problem

When I told Gemini to keep the metrics scripts permanently, another issue surfaced: it failed to officially install dev dependencies like acorn (the JavaScript parser).

Flash simply assumed that because it found packages in node_modules, it could safely use them. The only reason acorn was available was because I'd already installed a Markdown parser that depended on it.

The Refactoring Results

With the analyzer recreated, Flash showed how it had decomposed the monolithic function into focused helpers:

fetchContent (cognitive: 5)
parseElements (cognitive: 5)
updateTargetContent (cognitive: 7)
elementsFromSelector (cognitive: 2)
handleSrcAttribute orchestrator (cognitive: 10)

The Results

| Metric | Before | After | Improvement | |----|----|----|----| | Cognitive Complexity | 35 🛑 | 10 ✅ | -71% | | Cyclomatic Complexity | 22 | 7 | -68% | | Status | Critical Slop | Clean Code | — |

Manual inspection and thorough website testing revealed zero bugs. The cost? A 0.5K increase in file size - negligible.

Emboldened, I tackled the template processing logic. Since it spanned multiple functions, this required more extensive refactoring:

Extracted Functions:

collectNodesFromMutations - iteration logic
processAddedNode - scanning logic
transformTextNode - template interpolation for text
transformElementNode - attribute interpolation and recursion

Results:

| Function Group | Previous Max | New Max | Status | |----|----|----|----| | MutationObserver Logic | 31 🛑 | 6 ✅ | Clean | | domToElements Logic | 12 ⚠️ | 6 ✅ | Clean |

Final Library Metrics

After refactoring, lightview-x.js improved significantly:

Functions: 93 → 103 (better decomposition)
Avg Maintainability: 66.5 → 66.8
Avg Cognitive: 3.6 → 3.2

All critical slop eliminated. The increased function count reflects healthier modularity - complex logic delegated to specialized, low-complexity helpers. In fact, it is as good or better than established frameworks from a metrics perspective:

| File | Functions | Maintainability (min/avg/max) | Cognitive (min/avg/max) | Status | |----|----|----|----|----| | lightview.js | 58 | 7.2 / 65.5 / 92.9 | 0 / 3.4 / 25 | ⚖️ Good | | lightview-x.js | 103 | 0.0 / 66.8 / 93.5 | 0 / 3.2 / 23 | ⚖️ Good | | lightview-router.js | 27 | 24.8 / 68.6 / 93.5 | 0 / 2.1 / 19 | ⚖️ Good | | react.development.js | 109 | 0.0 / 65.2 / 91.5 | 0 / 2.2 / 33 | ⚖️ Good | | bau.js | 79 | 11.2 / 71.3 / 92.9 | 0 / 1.5 / 20 | ⚖️ Good | | htmx.js | 335 | 0.0 / 65.3 / 92.9 | 0 / 3.4 / 116 | ⚖️ Good | | juris.js | 360 | 21.2 / 70.1 / 96.5 | 0 / 2.6 / 51 | ⚖️ Good |

1. LLMs Mirror Human Behavior—For Better and Worse

LLMs exhibit the same tendencies as average developers:

Rush to code without full understanding
Don't admit defeat or ask for help soon enough
Generate defensive, over-engineered solutions when asked to improve reliability
Produce cleaner code with structure and frameworks

The difference? They do it faster and at greater volume. They can generate mountains of slop in hours that would take humans weeks.

2. Thinking Helps

Extended reasoning (visible in "thinking" modes) shows alternatives, self-corrections, and occasional "oh but" moments. The thinking is usually fruitful, sometimes chaotic. Don’t just leave or do something else when tasks you belive are comple or critical are being conducted. The LLMs rarely say "I give up" or "Please give me guidance" - I wish they would more often. Watch the thinking flow and abort the response request if necessary. Read the thinking and redirect or just say continue, you will learn a lot.

3. Multiple Perspectives Are Powerful

When I told a second LLM, "You are a different LLM reviewing this code. What are your thoughts?", magic happened.

This behavior is actually beyond what most humans provide:

How many human developers give rapid, detailed feedback without any defensive behavior?
How many companies have experienced architects available for questioning by any developer at any time?
How many code review conversations happen without ego getting involved?

4. Structure Prevents Slop

5. Metrics Provide Objective Truth

I love that formal software metrics can guide LLM development. They're often considered too dull, mechanical, difficult or costly to obtain for human development, but in an LLM-enhanced IDE with an LLM that can write code to do formal source analysis (no need for an IDE plugin subscription), they should get far more attention than they do.

Metrics don't lie. They identified the slop my intuition missed.

The Verdict

After 40,000 lines of LLM-generated code, I'm cautiously optimistic.

Yes, LLMs can generate quality code. But like human developers, they need:

Clear, detailed specifications
Structural constraints (frameworks, patterns)
Regular refactoring guidance
Objective quality measurements
Multiple perspectives on architectural decisions

The criticism that LLMs generate slop isn't wrong—but it's incomplete. They generate slop for the same reasons humans do: unclear requirements, insufficient structure, and lack of quality enforcement.

The difference is iteration speed. What might take a human team months to build and refactor, LLMs can accomplish in hours. The cleanup work remains, but the initial generation accelerates dramatically.

Looking Forward

I'm skeptical that most humans will tolerate the time required to be clear and specific with LLMs - just as they don't today when product managers or developers push for detailed requirements from business staff. The desire to "vibe code" and iterate will persist.

But here's what's changed: We can now iterate and clean up faster when requirements evolve or prove insufficient. The feedback loop has compressed from weeks to hours.

As coding environments evolve to wrap LLMs in better structure - automated metrics, enforced patterns, multi-model reviews -the quality will improve. We're not there yet, but the foundation is promising.

The real question isn't whether LLMs can generate quality code. It's whether we can provide them - and ourselves - with the discipline to do so consistently.

And, I have a final concern … if LLMs are based on history and have a tendency to stick with what they know, then how are we going to evolve the definition and use of things like UI libraries? Are we forever stuck with React unless we ask for something different? Or, are libraries an anachronism? Will LLMs and image or video models soon just generate the required image of a user interface with no underlying code?

Given its late entry into the game and the anchoring LLMs already have, I don’t hold high hopes for the adoption of Lightview, but it was an interesting experiment. You can visit the project at: [https://lightview.dev]()

\ \ \ \

Market Opportunity

Wink Price(LIKE)

$0.003092

$0.003092$0.003092

+0.09%

USD

Wink (LIKE) Live Price Chart

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.