LLuMi

Redefining Multilingual Intelligence. LLuMi is a state-of-the-art 70B model that fuses robust instruction tuning with DeepSeek-R1 inspired reasoning patterns for unmatched real-world performance.

Release Date: February 24, 2025

research@thelucy.tech

LLuMi v2 is currently training and will be released very soon.

1. Introduction

We introduce LLuMi, a state-of-the-art multilingual large language model (LLM) built on the robust Llama 3.3 70B architecture. LLuMi is instruction tuned to excel in real-world applications, particularly in multilingual dialogue and complex reasoning tasks.

Leveraging advanced refinements and distillation techniques inspired by the DeepSeek-R1 framework, LLuMi not only retains the core strengths of its Llama 3.3 foundation but also delivers enhanced performance and efficiency. By integrating large-scale reinforcement learning directly on the base model, LLuMi exhibits sophisticated chain-of-thought behaviors, improved self-verification, and reduced issues such as repetition and language mixing.

Distillation

We demonstrate that the advanced reasoning patterns of larger models can be distilled into smaller, more efficient models. This approach yields improved performance compared to the reasoning strategies derived solely via reinforcement learning on smaller models. The open source DeepSeek-R1 framework—and its API—play a crucial role in enabling this.

Post-Training

We directly apply reinforcement learning (RL) to the base LLuMi model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach enables LLuMi to explore advanced chain-of-thought (CoT) capabilities for tackling complex problems.

2. Model Distillation & Architecture

The LLuMi 70B model has been meticulously developed using the advanced techniques of DeepSeek-R1 Distill Llama 3.3 70B. Furthermore, we have infused our smaller LLuMi 8B and 3B models with a unique thinking property through the use of GRPO (Guided Reasoning Policy Optimization).

Two RL Stages

Designed to discover improved reasoning patterns and align the model with human preferences.

Two SFT Stages

Serving as the foundational seed for both the model’s reasoning and non-reasoning capabilities.

3. Model Downloads

LLuMi Think Models

Model	Base Model	Download
LLuMi Think 3B	Qwen2.5-3B-Instruct	🤗 HuggingFace
LLuMi Think 8B	Llama-3.1-8B-Instruct	🤗 HuggingFace
LLuMi Think 70B	Llama-3.3-70B-Instruct	🤗 HuggingFace

4. How to Use

This repository contains three versions of LLuMi Think LLM Models, for use with transformers and with bitsandbytes codebase.

Use with transformers

Starting with transformers >= 4.48.3 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

import transformers
import torch

model_id = "thellumi/LLuMi_Think_70B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Why are tomatoes red?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Use bitsandbytes

The model checkpoints can be used in 8-bit and 4-bit for further memory optimisations using bitsandbytes and transformers.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "thellumi/LLuMi_Think_70B"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

quantized_model = AutoModelForCausalLM.from_pretrained(
  model_id, device_map="auto", torch_dtype=torch.bfloat16, 
  quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_id)
input_text = "Why are tomatoes red?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)

print(tokenizer.decode(output[0], skip_special_tokens=True))

To load in 4-bit simply pass load_in_4bit=True

5. Usage Recommendations

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
Avoid adding a system prompt; all instructions should be contained within the user prompt.
For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}".
When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additionally, DeepSeek have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance.To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

6. Training Data

Overview

LLuMi is built upon the robust Llama 3.3 architecture, which was pretrained on approximately 15 trillion tokens sourced from publicly available datasets. For fine-tuning, LLuMi leverages a combination of publicly available instruction datasets and over 10 million examples sourced from Hugging Face. This comprehensive training corpus has been curated to ensure high performance across various languages, with dedicated support for Turkish and other languages.

Data Freshness

The pretraining data includes content up to a cutoff date of Aug. 2024, ensuring that LLuMi is aligned with recent language trends and developments.

7. Benchmarks

Model	AIME 2024 (pass@1)	MATH-500 (pass@1)	GPQA Diamond	LiveCodeBench	CodeForces
Claude-3.5-Sonnet-1022	16.0	78.3	65.0	38.9	717
OpenAI o1-1217	79.2	96.4	75.7	63.4	2061
OpenAI o1-mini	63.6	90.0	60.0	53.8	1820
OpenAI GPT-4o-0513	9.3	74.6	49.9	32.9	759
QwQ-32B-Preview	44.0	90.6	54.5	41.9	1316
DeepSeek R1	79.8	97.3	71.5	65.9	2209
LLuMi Think 70B	69.3	94.1	64.8	56.9	1603

Note on Benchmark Results: Due to hardware limitations, full-scale benchmark tests could not be performed, and the results may vary. We remain fully transparent about these constraints and are actively working towards securing the necessary resources to conduct comprehensive evaluations in the near future.

8. Responsibility & Safety

At LLuMi, we are committed to promoting responsible and ethical use of our technology. We recognize that large language models carry inherent risks and potential for misuse, and we have taken several measures to mitigate these challenges:

Bias Mitigation:We have implemented various strategies during training to minimize biases in model outputs. However, users should be aware that, despite these efforts, occasional biases or unintended outputs may still occur.
Usage Guidelines:LLuMi is designed for research and responsible deployment. We strongly encourage users to adhere to ethical guidelines, applicable laws, and best practices when using the model. Generating harmful, misleading, or offensive content is strictly prohibited.
Safety Measures:Users deploying LLuMi in real-world applications should implement additional safety filters and monitoring mechanisms. We recommend regular audits and evaluations to ensure that the model's outputs remain within acceptable ethical boundaries.
Community Engagement:We invite the community to provide feedback on any safety or ethical issues encountered during usage. This collaborative approach is vital for continuously refining the model and addressing potential risks.
Transparency and Accountability:By open-sourcing LLuMi, we aim to foster transparency and accountability. We commit to ongoing research and updates focused on improving the model's safety and ethical performance.

By using LLuMi, you agree to follow these guidelines and contribute to a safer, more responsible AI ecosystem.