Research

Research that ships

We build practical AI innovations — large-scale synthetic datasets, efficient models that run on consumer hardware, and production-ready systems for real-world deployment.

9M+
Synthetic training examples
$0.14
Cost per 1K samples
26M
Smallest model params
Flagship Project

TinyFabulist

Large-scale synthetic narrative generation

A multi-phase research initiative producing open datasets, translation frameworks, and compact language models — all optimized for cost-effective deployment on consumer hardware.

TF1

3M Synthetic English Fables

First open dataset of three million moral fables generated by instruction-tuned models. Each story follows a structured scaffold for consistent quality.

3M stories
arXiv 2025
TF2

English-Romanian Literary Translation

A unified framework for dataset creation, fine-tuning, and evaluation of literary translations. Includes a fine-tuned 12B model competitive with proprietary alternatives.

3M parallel pairs
12B params
arXiv 2025
TF3

Compact Romanian Language Models

End-to-end pipeline for training Romanian LMs from scratch: custom tokenizers, pretraining, compression via distillation, and large-scale dataset generation.

3M Romanian fables
26M params
arXiv 2025
Featured Publication

Synthetic Data Generation

Our comprehensive survey on generating training data using LLMs — published in IEEE Access.

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Mihai Nadăș, Laura Dioșan, Andreea Tomescu — IEEE Access, 2025

How enterprises can generate training data at scale — reducing annotation costs, addressing data scarcity, and enabling fine-tuning without exposing sensitive data.

19
Pages
64
References
3-26%
Performance gains
Innovation Focus

Where we push boundaries

Our research translates directly into practical capabilities for clients and portfolio companies.

Synthetic Data Generation

Generate training data at scale without exposing sensitive data. Our comprehensive IEEE Access survey covers techniques from prompt engineering to reinforcement learning — achieving 3-26% performance gains in low-data scenarios.

Training data Data augmentation Low-resource domains
Applies to: Healthcare Finance Legal

Efficient AI Systems

Models that run on consumer hardware at a fraction of the cost. Techniques include quantization, pruning, and knowledge distillation.

Edge deployment Cost reduction On-device AI

Multilingual NLP

Specialized expertise in NLP, including low-resource languages — addressing tokenization penalties and underrepresented language challenges.

Low-resource NLP Multilingual models Diacritic restoration

Educational AI

Value-aligned content generation for educational applications. Child-safe AI systems with explicit moral reasoning and age-appropriate outputs.

EdTech Content generation Value alignment
Applies to: Public Sector

Partner on R&D

From synthetic data generation to domain-specific model development — let's explore what's possible together.

·

I'm reaching out as a...

This helps us route your message to the right team

Help us prepare for our conversation

Optional — skip if you prefer

Tell us about yourself

Message received!

We'll send you a confirmation email shortly.

Want to track your inquiry and access exclusive content?

Create your KlusAI Hub account

Stay in the loop

Weekly insights on production AI — no hype, just what works.