Hebrew Lemmatization / Stemming
A standalone REST API for high-performance Hebrew lemmatization, powered by a resurrected version of the HebMorph library.
Overview
Section titled “Overview”This project provides a simple, reliable, and always-on RESTful service for lemmatizing Hebrew text. It was created to fill a gap in the open-source ecosystem, where existing solutions for Hebrew NLP were found to be either unstable, inaccurate (especially with construct-state cases, or “smichut”), or no longer maintained.
The service exposes two endpoints:
POST /lemmatize– canonical one-lemma-per-piece for production.POST /lemmatize-raw– raw HebMorph candidates per token for debugging.
API Usage
Section titled “API Usage”Endpoint: POST /lemmatize
Section titled “Endpoint: POST /lemmatize”Accepts a JSON object with a single key, sentences, which is an array of strings.
Mixed examples
Section titled “Mixed examples”Request (smichut, plural/singular, adjective gender, punctuation/parentheses, numerals/Latin, units with quotes/gershaim):
curl -X POST "https://teivah.solutions/api/lemmatize" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "sentences": [ "אבוקדו האס, אורגני", "חוות הבופאלו (לא שקיל)", "מיץ תפוחים", "מיץ תפוח", "חמאת בוטנים טבעית", "חמאת בוטן טבעית", "גרעיני דלעת קלויים אורגניים", "גרעין דלעת קלוי אורגני", "עוגיות שוקולד אורגניות", "עוגיית שוקולד אורגנית", "1 ק\"ג תמרים", "1 ק״ג תמרים", "200 ג\"רם", "500 גרם Nutrazen אורגני" ] }'Response (real output):
{ "results": [ ["אבוקדו","האס","אורגני"], ["חווה","הבופאלו","לא","שקיל"], ["מיץ","תפוח"], ["מיץ","תפוח"], ["חמאה","בוטן","טבעי"], ["חמאה","בוטן","טבעי"], ["גרעיני","דלעת","קלוי","אורגני"], ["גרעין","דלעת","קלוי","אורגני"], ["עוגייה","שוקולד","אורגני"], ["עוגייה","שוקולד","אורגני"], ["1","קג","תמר"], ["1","קג","תמר"], ["200","גרם"], ["500","גרם","Nutrazen","אורגני"] ]}Endpoint: POST /lemmatize-raw
Section titled “Endpoint: POST /lemmatize-raw”Accepts a JSON object with a single key, sentence, which is one string. Returns the raw HebMorph candidates per token (no filtering/sorting by the service), useful for debugging and analysis.
Request:
curl -X POST "https://teivah.solutions/api/lemmatize-raw" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "sentence": "אבוקדו האס, אורגני" }'Response (real output excerpt):
{ "results": [ [ { "lemma": "אבוקדו", "score": 1.0, "mask": "D_NOUN", "prefixLength": 0 }, { "lemma": "אבוקדו", "score": 1.0, "mask": "D_NOUN", "prefixLength": 0 } ], [], [ { "lemma": "אורגן", "score": 1.0, "mask": "D_NOUN", "prefixLength": 0 }, { "lemma": "אורגן", "score": 1.0, "mask": "D_NOUN", "prefixLength": 0 }, { "lemma": "אורגני", "score": 1.0, "mask": "D_ADJ", "prefixLength": 0 }, { "lemma": "אורגני", "score": 1.0, "mask": "D_ADJ", "prefixLength": 0 } ] ]}Canonical Post-processing Logic (used by /lemmatize)
Section titled “Canonical Post-processing Logic (used by /lemmatize)”- Split input sentences by whitespace.
- Strip leading/trailing non-letter/digit characters from each piece (removes punctuation and parentheses at edges).
- For each cleaned piece, select exactly one canonical lemma from HebMorph candidates using the following policy:
- Sort by score descending.
- POS-based tie-breaks:
- If the piece ends with “י” or “ית”: prefer ADJ > NOUN > VERB.
- If the piece ends with “ה”: prefer NOUN > ADJ > VERB.
- Otherwise: ADJ > NOUN > VERB.
- Next, prefer lemma identical to the cleaned surface piece.
- Finally, prefer the shortest lemma.
- Drop lemmas with length == 1.
- Numerals and Latin tokens pass through unchanged.
- Do not emit punctuation tokens.
Features
Section titled “Features”- Deterministic & Fast: Provides production-stable lemmatization optimized for consistency and speed.
- High Performance: Built on Javalin and deployed on Google Cloud Run for fast, auto-scaling responses.
- Stable & Secure: Routed through Zuplo to ensure high availability and prevent abuse.
- Simple REST API: Easy to integrate into any application stack.
- Batch Processing: Lemmatize multiple sentences in a single API call for efficiency.
Acknowledgements
Section titled “Acknowledgements”This project would not be possible without the original work done by Shay Synhershko and the other contributors to the HebMorph project. We have gratefully resurrected their powerful library to make it accessible as a public API.