Back to projects
February 2025
5 min read

Multilingual Sentiment Analysis using LLMs

Finetuning a Llama 3.1 8b model to understand the sentiment of Indian Languages
  • Pytorch
  • Python
  • Unsloth
  • Llama
  • LLM
  • NumPy

Executive Summary

Developed and fine-tuned a state-of-the-art Llama 3.1 8B model for a Kaggle competition, focusing on high-accuracy sentiment analysis across 11 Indian languages. By implementing cutting-edge, memory-efficient techniques like QLoRA and the Unsloth optimization library, the model was successfully trained on a single Tesla T4 GPU. The final model achieved an F1-score of 0.974, securing 11th place out of 223 participants (Top 5%). This represented a 76.9% performance uplift over the base model, demonstrating expert-level ability in model specialization and prompt engineering for low-resource languages.

Problem Statement

Sentiment analysis is a cornerstone of modern NLP, but most commercial solutions are heavily biased towards high-resource languages like English. For businesses operating in India, understanding customer feedback across languages like Tamil, Bengali, and Kannada is critical but technologically challenging. This project aimed to build a single, robust model capable of:

  1. Performing accurate binary sentiment classification (Positive/Negative) on text from 11 different Indian languages.
  2. Handling a balanced dataset of 1,000 training samples and 100 test samples.
  3. Producing a reliable, machine-readable JSON output for seamless integration into production systems.

Tech Stack

  • Programming & Core Libraries: Python, PyTorch, Pandas
  • LLM & NLP: Meta Llama 3.1 8B Instruct, Hugging Face transformers
  • Fine-Tuning & Optimization:
    • Unsloth: Accelerated training by ~2x and reduced memory usage by 50%, overcoming critical GPU memory bottlenecks.
    • PEFT (Parameter-Efficient Fine-Tuning): Utilized the QLoRA technique for efficient model adaptation.
    • bitsandbytes: Enabled 4-bit quantization (NF4) of the base model.
    • TRL (Transformer Reinforcement Learning): Employed the SFTTrainer for supervised fine-tuning.
  • MLOps & Experimentation: Weights & Biases (WandB) for tracking and comparing 29 distinct training runs.
  • Platform: Kaggle Notebooks with a Tesla T4 GPU.

Solution Architecture & Methodology

1. Data Preparation & Advanced Prompt Engineering: The dataset comprised 1,000 training and 100 test examples, balanced across 11 languages. To ensure the model understood its task precisely, a sophisticated few-shot prompt template was engineered using Llama 3.1-specific chat tokens (<|start_header_id|> etc.). This was a critical step to solve the initial challenge of the model producing conversational filler instead of pure JSON. The final prompt structure guided the model’s reasoning and output format with near-perfect consistency.

2. Efficient Fine-Tuning and Systematic Optimization: Training an 8B parameter model on a 15GB T4 GPU is non-trivial. This was achieved via a two-pronged optimization strategy:

  • QLoRA (Quantized Low-Rank Adaptation): Instead of training all 8 billion parameters, the base model was “frozen” in a 4-bit format. Small, trainable LoRA adapter layers (~21M parameters) were inserted into all key modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). Training only these adapters made the process computationally feasible.
  • Unsloth Optimization: The Unsloth library was essential for overcoming memory errors encountered in early experiments. Its custom Triton kernels re-engineered the model’s core layers, resulting in a 2x training speedup and 50% VRAM reduction, which enabled stable training within Kaggle’s resource limits.

3. Rigorous Experiment Tracking with WandB: The final high-performing model was not an accident. It was the result of a meticulous hyperparameter search involving 29 distinct experiments, all logged and visualized using Weights & Biases. This systematic approach allowed for the data-driven optimization of:

  • LoRA Rank (r): Tested values from 8 to 32.
  • Learning Rate and Scheduler (cosine vs. linear).
  • Batch Size and Gradient Accumulation. This process was key to identifying the optimal configuration that yielded the competition-winning performance.

WanDB Dashboard for this project

Results & Impact

  • Competition Rank: 11th place out of 223 teams (Top 5%), demonstrating a world-class solution.
  • Quantitative Performance: Achieved a final F1-score of 0.9743 on the private test set.
  • Performance Uplift: The fine-tuned model showed a 76.9% improvement over the base Llama 3.1 model’s zero-shot F1-score of 0.5507, proving the immense value of the fine-tuning process.
  • Qualitative Analysis: The model correctly identified sentiment even in cases with subtle linguistic cues, transliterated words, and complex sentence structures, showcasing a deep contextual understanding.
  • Production Readiness: The model’s consistent JSON output makes it suitable for deployment via an API endpoint in a real-world application.

Multilingual Sentiment Analysis Engine | Python, PyTorch, LLMs, Hugging Face, WandB

  • Engineered a sentiment analysis model by fine-tuning Llama 3.1 (8B), achieving a Top 5% rank (11/223) in a Kaggle competition for 11 Indian languages.
  • Boosted the F1-score from a baseline of 0.55 to 0.974 (a 76.9% improvement) by designing a robust few-shot prompt template and systematically tuning hyperparameters over 29 experiments tracked with WandB.
  • Implemented Parameter-Efficient Fine-Tuning (PEFT) with QLoRA and the Unsloth library to successfully train the 8B model on a single 15GB GPU, reducing memory usage by 50% and increasing training speed by 2x.
  • Enforced strict JSON output through specialized prompt engineering, ensuring the model is production-ready for API integration.