استعادة النصوص القديمة باستخدام الشبكات العصبية العميقة

رسالة ماجستير

اسم الباحث : علي عباس علي ابو العوب

اسم المشرف : بهيجة خضير شكر ; اسيا مهدي ناصر الزبيدي

الكلمات المفتاحية :

الكلية : كلية علوم الحاسوب وتكنولوجيا المعلومات

الاختصاص : علوم الحاسوب

سنة نشر البحث : 2024

تحميل الملف : اضغط هنا لتحميل البحث

المشاهدات: 12

الخلاصة

تعد النصوص القديمة مهمة لأنها تربطنا بالحضارات القديمة، والتي من خلالها نكتسب المعرفة الثقافية والدينية والعلمية. غالبًا ما تكون النصوص القديمة، سواء كانت مكتوبة على ورق البردي أو البرشمان أو غيرها، ناقصة أو متآكلة جزئيًا بسبب مرور الزمن. تمثل استعادة هذه النصوص تحديًا كبيرًا للمؤرخين والعلماء، مما يتطلب جهدًا وخبرة يدوية دقيقة.

استعادة النص القديم هو فرع متخصص من علم استعادة النصوص يركز على استعادة المحتوى النصي من الوثائق التاريخية أو القديمة والحفاظ عليه.

تعتمد طرق الاستعادة التقليدية بشكل كبير على التدخل اليدوي من قبل الخبراء، وهو أمر يستغرق وقتًا طويلاً وغالبًا ما يكون صعبا. في السنوات الأخيرة، أظهرت تقنيات التعلم الآلي (ML) والذكاء الاصطناعي (AI) نتائج واعدة في تكملة عملية الاستعادة وتحسينها.

أظهرت تقنيات التعلم العميق نجاحًا ملحوظًا في مجالات مختلفة، بما في ذلك معالجة الصور ومعالجة اللغات الطبيعية. في هذه الرسالة تم اقتراح نماذج مختلفة لترميم النصوص القديمة باستخدام الشبكات العصبية العميقة.

تم استخدام مجموعتي بيانات لتدريب واختبار النماذج، مجموعة البيانات الأولى هي “المخطوطة السينائية” وهي مخطوطة يعود تاريخها إلى القرن الرابع، وهي قطعة أثرية مهمة لأنها توفر أقدم نسخة كاملة موجودة من العهد الجديد في الكتاب المقدس المسيحي. المادة المكتوبة بخط اليد مكتوبة باللغة اليونانية.

مجموعة البيانات الثانية هي “Argonautica 3” والتي تشير إلى قصيدة ملحمية كتبها الشاعر اليوناني القديم أبولونيوس الرودسي في القرن الثالث قبل الميلاد وهي مكتوبة باللغة اليونانية أيضًا.

تم معالجة البيانات مسبقًا عن طريق ترميز البيانات ثم إزالة الخطوط والأرقام والرموز والأحرف الخاصة. بعد ذلك، تم تقطيع النص الناتج، وإنشاء حرف مفقود وتسمية الفئات، وإجراء تضخيم البيانات لتعزيزها، وثم جعلها متساوية الطول.

تم استخدام ثلاثة نماذج للتنبؤ كنماذج مقترحة لاستعادة النصوص القديمة المفقودة، وهي الذاكرة الطويلة المدى (LSTM)، والشبكات العصبية المتكررة (RNN)، والشبكات الخصومة التوليدية (GAN) وكانت النتائج اختبار الدقة 86%، 92% و98.3 % وفقًا لمجموعة البيانات الأولى و94% و88% و98.7% وفقًا لمجموعة البيانات الثانية على التوالي.

وبمقارنة أداء كل نموذج، أعطى GAN أفضل النتائج من حيث الدقة، وبالتالي أثبت فعاليته في مجال استعادة النص المفقود. كما تمت مقارنة نتائج النظام المقترح مع تقنيات الاستعادة الأخرى، حيث أظهرت النتائج أن التقنية المقترحة حققت نتائج دقة أعلى من غيرها.

بشكل عام، يساهم هذا العمل في دمج العلوم المختلفة مثل دمج التعلم العصبي العميق مع العلوم الإنسانية الرقمية، مما يوفر حلاً واعدًا لترميم القطع الأثرية النصية القديمة والحفاظ عليها.

Ancient Textual Restoration Using Deep Neural Networks

Abstract

Ancient texts are important because they connect us with ancient civilizations, through which we gain cultural, religious and scientific knowledge. Ancient texts, whether on papyrus, parchment, or other substrates, are often fragmented, degraded, or partially erased due to the passage of time. Restoration of these texts presents a significant challenge to historians and scholars, requiring meticulous manual effort and expertise.
Ancient text restoration is a specialized branch of the text restoration that focuses on recovering and preserving textual content from historical or ancient documents.
Traditional restoration methods rely heavily on manual intervention by experts, which is time-consuming and often subjective. In recent years, the application of machine learning (ML) and artificial intelligence (AI) techniques has shown promise in automating and enhancing the restoration process.
Deep learning techniques have shown remarkable success in various domains, including image processing and natural language processing. In this thesis, different models were proposed for the restoration of ancient texts by using deep neural networks.
Two datasets used for training and testing the models the first dataset being “Codex Sinaiticus” a manuscript dating back to the fourth century, it is a significant artifact as it provides the earliest extant complete copy of the New Testament in the Christian Bible. The handwritten material is written in the Greek language.
The second dataset being “Argonautica 3” which refers to an epic poem written by the ancient Greek poet Apollonius of Rhodes in the 3rd century BCE which is written in the Greek language too.
The dataset has been preprocessed by encoding dataset. New lines, numbers, symbols and special characters have been removed. After that the result text has been tokenized, generate missing character, class label obtained, augmentation performed to support dataset, and normalization process performed.
Three prediction models were used as proposed models for retrieving missing ancient texts, Long Short-Term Memory (LSTM), Recurrent Neural Networks (RNN), and Generative Adversarial Networks (GAN) and the results were testing accuracy 86%, 92% and 98.3% according to the first dataset and 94%, 88% 98.7% according to the second dataset respectively.
Comparing the performance of each model, GAN gave the best accuracy results, and thus it proved its effectiveness in the field of restoring missing text. The results of the proposed system were also compared with other restoration techniques, where the results showed that the proposed technique had higher accuracy results than others.
Overall, this work contributes to the interdisciplinary intersection of deep learning and digital humanities, offering a promising solution for the restoration and preservation of ancient textual artifacts.