
Generative AI
How I trained my own models of txt2img
Engineer & artist
2021-now
Training a LoRA
Training large AI models demands massive GPU resources, which most individual developers can't afford. LoRA (Low-Rank Adaptation of Large Language Models) models provide a cost-effective and efficient starting point, making AI model training accessible to everyone.

The criteria for “good”
Before diving into model training, it’s crucial to establish clear objectives. For example, when training a LoRA model for anime characters—one of the most challenging types—I’ve defined three key criteria as benchmarks for "good".

Accurate
Generates correct features

Versatile
Applies different styles

Adaptable
Supports variations
hypothesis of the learning process
Understanding how the training algorithm works is essential, especially when preparing the dataset. However, the algorithm's logic isn’t always practical as direct guidance, so I’ve translated it into a more readable and actionable format.

Training steps
Data Collection
Collecting:
-
image archive website
-
open database
Creating:
-
3D models
-
Screenshots
Prompt editting
AI interrogation
Manual cleanup
-
Remove inherent
-
Remove synonyms
-
Add missing
-
Add special prompts
Training
Learning rate
learning scheduler
weight decay
scalable norms
Layered weights
Training with synthetic data
Training with synthetic data is a common approach I use. It is error-free, fully controllable, and ensures clean, noise-free data while addressing volume limitations. However, its stylistic uniformity can hinder model performance. To counter this, I apply techniques like constraining specific layer norm weights or using penalty functions to mitigate these issues.

Result comparison
To illustrate my model's performance, I use a complex character as an example. The first image, serving as the baseline, is from the in-game 3D model. The second is generated by my model, and the third comes from the top-ranked community contributor. Results show that my model achieves superior accuracy in detail reproduction and overall image quality. For more models, check out my CivitAI page.




Objective path
subjective Impact
How I fine-tuned my own base model
A model that can produce by demand
When none of the existing models in the community could perfectly produce the results I envisioned, I decided to take on the challenge of training my own large-scale text-to-image model in an anime style. My ambition extended beyond simply matching or surpassing the quality of mainstream models of the time. At the heart of this endeavor was a deeper goal: to explore how objective descriptions could be used to generate subjective artistic styles through this model.


Training in 2 phases
The training process was divided into two key phases. The first phase involved fine-tuning a pre-trained model using a vast dataset of anime images. This step aimed to instill the concept of "anime" into the model, effectively transforming a general-purpose base model into one specialized for anime. I referred to this phase as the "Base Training."
The second phase was far more meticulous. I carefully curated a refined selection of high-quality images and invested significant effort in classifying and annotating them based on style. This dataset was then used for a second round of training on the model produced in the first phase. The goals here were twofold: to achieve superior image generation quality and, more importantly, to embed my stylistic preferences into the model. I called this phase "Quality Training."
Steps of base training
Data collecting
Images (10M+)
Metadata
No sythetic
Metadata processing
Metadata cleaning
Tag extraction
Quality defining
Base training
Base on a pre-train model
Concept forming
Basic style forming
Steps of quality training
Data collecting
Images selected (40K)
Metadata
Metadata processing
Style defining
Quality defining
Base training
Base on the phse 1 model
style control forming
Style control
Style is an inherently subjective concept, while prompts are objective descriptions. As someone who is both an engineer and an artist, I couldn't help but notice this rarely acknowledged contradiction: how can objective inputs be used to create something profoundly subjective?

Common approach
One of the most common and widely adopted methods in the industry for style control involves using the names of painters. During training, images are directly labeled with the painter's name, and during generation, their styles are reproduced simply by referencing their names—often even blending the styles of multiple painters seamlessly. This approach is straightforward and achieves a high degree of accuracy in recreating the desired style. However, it has significant drawbacks: lacks of flexibility, as it cannot isolate and control specific stylistic elements unique to a particular artist. More critically, it raises concerns about copyright issues and respect for the intellectual property and legacy of the artists themselves.




My approach
My approach involves breaking down the stylistic features of the high-quality images collected for the second round of training according to predefined dimensions and labeling them accordingly. For instance, under coloring styles, categories like high contrast, vibrant, and candy colors are included, while shading styles may include cel-shading, soft shading, and more. This process heavily relies on my personal experience and expertise in art, yet it remains an exceptionally labor-intensive task. At this stage, delegating such work to machines is not feasible, as aesthetic judgment still requires human input and discernment.
