Lille Open-Source LLM: Local Training with Sophia Optimizer

What Lille is and why it matters

Lille is a truly open-source small language model you can study, change, and even train on your own machine. Everything is public: dataset, tokenizer, training code, and optimizer. That makes it different from most "open" models that only share weights.

The model is listed as 130M (closer to 140M actual params) and has a base and an instruction-tuned release. See the Lille model page for source files.

Key components of the Lille stack

Tokenizer: Hastings, a 32k vocabulary tokenizer used for consistent tokenization.
Dataset: FineWeb-Edu for pretraining and Kyoto-Corpus for instruction tuning.
Optimizer: Sophia-Triton, a memory-efficient Triton-based implementation of SophiaG designed to fit training on a single GPU.
Evaluations: simple-eval, a small framework to judge outputs.

All pieces are available from the project and linked from the model page on Hugging Face: Lille on Hugging Face.

Who should use Lille

This is aimed at independent ML researchers, developers on a budget, educators, and small teams who want a fully transparent model they can run locally. If you want to train or fine-tune an LLM without cloud costs, Lille is a practical choice.

What you need to train Lille locally

Hardware

One NVIDIA GPU with at least 12GB VRAM. The creator used an RTX 4070-TI.
Modern CPU and 32GB+ RAM recommended for preprocessing and datapipeline.

Software and packages

Linux or WSL recommended.
Python 3.10+.
PyTorch with CUDA matching your drivers.
Triton for the Sophia-Triton optimizer.
The Lille training repo and tokenizer from the model page.

Quick start: run Lille with the SDK

If you just want to run the model and try prompts, use the simpleai-sdk. It handles backends and caching.

from simple_ai import lille

# This will download and cache the model on first run
lille.generate("Your prompt here")

This is the fastest way to experiment. Use this to check outputs before you train or fine-tune.

Direct Hugging Face usage

To load the model with transformers, register the custom architecture and call from_pretrained. The snippet below is the same pattern used by the project:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from transformers import LilleConfig, LilleForCausalLM

# Register the custom model architecture
AutoConfig.register("lille-130m", LilleConfig)
AutoModelForCausalLM.register(LilleConfig, LilleForCausalLM)

MODEL = "Nikity/lille-130m-instruct"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype="auto",
    device_map=DEVICE,
)

chat = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
    chat,
    add_generation_prompt=True,
    return_tensors="pt"
).to(DEVICE)

with torch.inference_mode():
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=True,
        temperature=0.5,
        top_p=0.95,
    )

How to set up a local training run

Training locally means you control every step. I'll keep this short and practical. Use the project repo for full scripts linked on the model page: source.

Clone and install deps: Clone the training repo and install with pip. Make sure Triton and the right CUDA/PyTorch versions are installed.
Prepare the tokenizer: Use the provided Hastings tokenizer. Tokenize your dataset with the same settings to keep consistency.
Prepare data: Use the curated FineWeb-Edu subset for pretraining and Kyoto-Corpus for instruction tuning if you want an instruct model.
Choose optimizer: Use Sophia-Triton for memory efficiency. It reduces memory overhead compared to common Adam variants.
Start training: Use the repo's training script. Tune batch size to fit your GPU. The author trained on a single RTX 4070-TI by lowering batch size and using the Sophia-Triton optimizer.

Practical tips when training on one GPU

Reduce batch size and increase gradient accumulation steps.
Use mixed precision (fp16) to cut VRAM use.
Enable optimizer offloading if available.
Monitor GPU memory with nvidia-smi and adjust steps if you see OOMs.

Why Sophia-Triton helps

Sophia-Triton is a Triton-based implementation of the SophiaG optimizer. It trades a little runtime for much lower memory use.

If you want to dig deeper, the Lille model page hosts the optimizer and training code: Lille on Hugging Face.

Evaluation and next steps

Use the provided simple-eval framework to run quick checks. It uses an LLM judge to score outputs. After you train:

Run the validation suite in the repo.
Compare base vs instruct variants on prompts you care about.
Fine-tune on your own data with Kyoto-Corpus-style formatting.

Common pitfalls and quick fixes

OOM errors: Lower batch size, increase accumulation, enable fp16.
Slow runs: Profile data loading and use faster storage (NVMe).
Mismatch tokenizer: Always use the project tokenizer to avoid tokenization drift.

Where to go for help

Open an issue on the Lille repo or check the Hugging Face model page for links to code and files. If something breaks during setup, roll back changes, check CUDA/PyTorch versions, and confirm Triton is compatible.

Final notes

I kept this practical and focused so you can try Lille on a 4070-TI. Start with the SDK to confirm behavior, then move to local training with Sophia-Triton when you want full control. Next step: clone the repo, install Triton, and run a tiny training job to confirm your environment. You’ll learn fast and keep full visibility into the stack.