Lifeweb language models

Welcome to the Lifeweb Language Models repository. Here we aim to train different Persian Language models and release them publicly to contribute our share to the Persian language's AI field. The first versions of our models are all trained on our dataset called Divan with more than 164 million documents and more than 10B tokens which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model.

Use Models

You can easily access the models using the links of Huggingface model hub provided in the table below.

Model Name	Base Model	Vocabulary Size
Tehran	Roberta	50000	Results
Shiraz	MobileBert	50000	Results

from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline

model_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."


classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}

Results

The Lifeweb models are evaluated on three downstream NLP tasks comprising NER, Sentiment Analysis, and Emotion Detection. Tehran outperforms every other Persian language model in terms of accuracy and macro F1. Additionally, Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to MobileBERT paper, this model is 4.3× smaller and 5.5× faster than BERT-base. We assert that our models outperform all similar models in the field, achieving a new state-of-the-art performance. Referencing ParsBERT, AriaBERT and FaBERT, we substantiate this claim by demonstrating superior evaluation metrics, even as they themselves have highlighted their better performance among other suitable models.

Obvious from the table below, you can find the Colab codes for each task to use as a tutorial besides the macro F1 score. These Colab codes are run equally on 4x2080 TI graphic cards.

Model	NER		Sentiment		Emotion
	Arman	Peyma	Sentipers (multi)	Snappfood	Arman
lifeweb-ai/tehran	71.87%	90.79%	63.75%	88.74%	77.73%
lifeweb-ai/shiraz	67.62%	86.24%	59.17%	88.01%	66.97%
sbunlp/fabert	71.23%	88.53%	58.51%	88.60%	72.65%
ViraIntelligentDataMining/AriaBERT	69.12%	87.15%	59.26%	87.96%	69.11%
HooshvareLab/bert-fa-zwnj-base	67.49%	85.73%	59.61%	87.58%	59.27%
HooshvareLab/roberta-fa-zwnj-base	69.73%	86.21%	56.23%	87.19%	57.96%

If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also, make sure to have your code available online so that we can add a reference.

Contributors

Mehrdad Azizi: Linkedin, Github
Reza Salehi Chegeni: Linkedin, Github
Parisa Mousavi: Linkedin, Github
Iman Hashemi: Linkedin, Github

Releases

v1.0(2024-03-09)

First version of Tehran and Shiraz models trained on DIVAN.

License

By contributing to this project, you agree that your contributions will be licensed under the Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lifeweb language models

Use Models

Results

Contributors

Releases

License

About

Contributors 3

lifeweb-ir/LM

Folders and files

Latest commit

History

Repository files navigation

Lifeweb language models

Use Models

Results

Contributors

Releases

License

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3