top of page
AI cover.jpeg

Generative AI

How I trained my own models of txt2img

Engineer & artist
2021-now

Training a LoRA

Training large AI models demands massive GPU resources, which most individual developers can't afford. LoRA (Low-Rank Adaptation of Large Language Models) models provide a cost-effective and efficient starting point, making AI model training accessible to everyone.

The Void Sea - lowres_edited.jpg

The criteria for “good”

Before diving into model training, it’s crucial to establish clear objectives. For example, when training a LoRA model for anime characters—one of the most challenging types—I’ve defined three key criteria as benchmarks for "good".

Accurate

Generates correct features

Versatile

Applies different styles

Adaptable

Supports variations

hypothesis of the learning process

Understanding how the training algorithm works is essential, especially when preparing the dataset. However, the algorithm's logic isn’t always practical as direct guidance, so I’ve translated it into a more readable and actionable format.

Training steps

Data Collection

Collecting:​

  • ​image archive website

  • open database

Creating:​

  • 3D models

  • ​Screenshots

Prompt editting

AI interrogation

 

Manual cleanup

  • Remove inherent

  • Remove synonyms

  • Add missing

  • Add special prompts

Training

Learning rate

learning scheduler

weight decay

scalable norms

Layered weights

Training with synthetic data

Training with synthetic data is a common approach I use. It is error-free, fully controllable, and ensures clean, noise-free data while addressing volume limitations. However, its stylistic uniformity can hinder model performance. To counter this, I apply techniques like constraining specific layer norm weights or using penalty functions to mitigate these issues.

Result comparison

To illustrate my model's performance, I use a  complex character as an example. The first image, serving as the baseline, is from the in-game 3D model. The second is generated by my model, and the third comes from the top-ranked community contributor. Results show that my model achieves superior accuracy in detail reproduction and overall image quality. For more models, check out my CivitAI page.

OKAI-example-02.heif

Objective path
subjective Impact

How I fine-tuned my own base model

A model that can produce by demand

When none of the existing models in the community could perfectly produce the results I envisioned, I decided to take on the challenge of training my own large-scale text-to-image model in an anime style. My ambition extended beyond simply matching or surpassing the quality of mainstream models of the time. At the heart of this endeavor was a deeper goal: to explore how objective descriptions could be used to generate subjective artistic styles through this model.

Training in 2 phases

The training process was divided into two key phases. The first phase involved fine-tuning a pre-trained model using a vast dataset of  anime images. This step aimed to instill the concept of "anime" into the model, effectively transforming a general-purpose base model into one specialized for anime. I referred to this phase as the "Base Training."

The second phase was far more meticulous. I carefully curated a refined selection of high-quality images and invested significant effort in classifying and annotating them based on style. This dataset was then used for a second round of training on the model produced in the first phase. The goals here were twofold: to achieve superior image generation quality and, more importantly, to embed my stylistic preferences into the model. I called this phase "Quality Training."

Steps of base training

Data collecting

Images (10M+)​

Metadata

No sythetic

Metadata processing

Metadata cleaning

Tag extraction

Quality defining

Base training

Base on a pre-train model

Concept forming

Basic style forming

Steps of quality training

Data collecting

Images selected (40K)​

Metadata

Metadata processing

Style defining

Quality defining​

Base training

Base on the phse 1 model

style control forming

Style control

Style is an inherently subjective concept, while prompts are objective descriptions. As someone who is both an engineer and an artist, I couldn't help but notice this rarely acknowledged contradiction: how can objective inputs be used to create something profoundly subjective?

Digital Network_edited.jpg

Common approach

One of the most common and widely adopted methods in the industry for style control involves using the names of painters. During training, images are directly labeled with the painter's name, and during generation, their styles are reproduced simply by referencing their names—often even blending the styles of multiple painters seamlessly. This approach is straightforward and achieves a high degree of accuracy in recreating the desired style. However, it has significant drawbacks: lacks of flexibility, as it cannot isolate and control specific stylistic elements unique to a particular artist. More critically, it raises concerns about copyright issues and respect for the intellectual property and legacy of the artists themselves.

My approach

My approach involves breaking down the stylistic features of the high-quality images collected for the second round of training according to predefined dimensions and labeling them accordingly. For instance, under coloring styles, categories like high contrast, vibrant, and candy colors are included, while shading styles may include cel-shading, soft shading, and more. This process heavily relies on my personal experience and expertise in art, yet it remains an exceptionally labor-intensive task. At this stage, delegating such work to machines is not feasible, as aesthetic judgment still requires human input and discernment.

© 2035 by Yangxin Chen. Powered and secured by Wix

bottom of page