Trang thông tin luận án của Nghiên cứu sinh Trần Thị Xuân

THÔNG TIN VỀ LUẬN ÁN TIẾN SĨ

Họ và tên nghiên cứu sinh: Trần Thị Xuân

Tên đề tài luận án: “Nâng cao hiệu quả phân tích protein sửa đổi sau dịch mã trên cơ sở kết hợp mô hình học máy và xử lý ngôn ngữ tự nhiên”

Chuyên ngành: Khoa học máy tính Mã số: 9 48 01 01

Cán bộ hướng dẫn khoa học:

1. PGS.TS. Lê Nguyễn Quốc Khánh

2. TS. Nguyễn Văn Núi

Tóm tắt các kết quả mới của luận án:

Mục tiêu nghiên cứu

Mục tiêu chính của luận án là nghiên cứu, đề xuất và phát triển các mô hình tính toán mới nhằm nâng cao hiệu quả dự đoán các vị trí sửa đổi sau dịch mã (Post-translational Modifications – PTMs) trên protein. Đây là một trong những vấn đề trọng tâm của tin sinh học hiện đại, bởi các PTM đóng vai trò quan trọng trong điều hòa chức năng sinh học, ảnh hưởng trực tiếp đến quá trình phát triển tế bào, tín hiệu nội bào và các bệnh lý ở người.

Đối tượng nghiên cứu:

Các protein sửa đổi sau dịch mã (PTM), và tập trung vào ba loại đặc biệt quan trọng là SUMOylation, Succinylation và Ubiquitination: Đây là những PTM phổ biến, có dữ liệu tương đối đầy đủ; Đồng thời vẫn còn nhiều khoảng trống cần được khai thác để nâng cao hiểu biết khoa học. Do dữ liệu cấu trúc bậc cao của protein (bậc 2, 3, 4) trong các ngân hàng hiện nay còn thiếu, không đồng bộ và đòi hỏi dung lượng lưu trữ rất lớn, luận án lựa chọn cấu trúc protein bậc 1 dưới dạng chuỗi FASTA làm đầu vào. Đây là dạng dữ liệu vừa phổ biến, tiết kiệm tài nguyên, vừa phù hợp với các kỹ thuật học máy, học sâu và xử lý ngôn ngữ tự nhiên (NLP) hiện đại.
Các mô hình tính toán dự đoán vị trí PTM dựa trên học máy kết hợp NLP: Kỹ thuật phổ biến để dự đoán vị trí PTM, có độ chính xác cao hiện nay chính là kỹ thuật khối phổ và giải trình tự. Tuy nhiên, kỹ thuật này có chi phí rất lớn, thời gian thực hiện lâu, và đặc biệt là khó áp dụng với nhiều protein cùng lúc. Chính vì vậy, việc nghiên cứu các mô hình dự đoán PTM dựa trên học máy, kết hợp với NLP là một cách tiếp cận phù hợp bởi nó khai thác được những tiến bộ của học máy và kỹ thuật NLP nhằm giúp ngắn thời gian cần thiết và hỗ trợ hiệu quả cho các nhà sinh/y học đưa ra những kết luận nhanh và chính xác. Bên cạnh đó, cách tiếp cận này cho thấy sự phù hợp với nhu cầu và xu hướng phát triển hiện nay của trí tuệ nhân tạo.

Phương pháp nghiên cứu:

Để đạt được mục tiêu nghiên cứu, luận án triển khai song song hai hướng chính, đó là: (1) Nghiên cứu cơ sở lý thuyết để đề xuất mô hình mới và (2) Nghiên cứu thực nghiệm nhằm kiểm chứng hiệu quả các mô hình này.

Về lý thuyết, luận án kế thừa và phát triển các phương pháp hiện đại của khoa học dữ liệu và trí tuệ nhân tạo, bao gồm: Xử lý ngôn ngữ tự nhiên (NLP) để biểu diễn “ngôn ngữ protein” và trích xuất ngữ nghĩa, ngữ cảnh từ chuỗi axit amin; Các kiến trúc học sâu lai (Hybrid Deep Learning) kết hợp CNN và LSTM/Bi-LSTM nhằm tận dụng đồng thời khả năng phát hiện đặc trưng cục bộ và quan hệ tuần tự dài hạn; Học chắt lọc tri thức (Knowledge Distillation) để xây dựng các mô hình gọn nhẹ, thích ứng với dữ liệu hạn chế; và Học máy tổ hợp (Ensemble Learning) nhằm khai thác ưu thế của nhiều bộ phân loại để nâng cao độ tin cậy dự đoán.
Về thực nghiệm, các mô hình được huấn luyện và kiểm định trên dữ liệu PTM thực tế, sau đó so sánh với các phương pháp tiên tiến hiện có. Kết quả đánh giá giúp khẳng định tính khả thi, hiệu quả và ý nghĩa ứng dụng của các mô hình đề xuất trong việc dự đoán vị trí sửa đổi sau dịch mã trên protein.

Ý nghĩa khoa học:

Góp phần khẳng định tính khả thi và hiệu quả của NLP khi áp dụng vào dữ liệu protein.
Cung cấp bằng chứng thực nghiệm về ưu thế của các kiến trúc học sâu lai và học chắt lọc tri thức trong bối cảnh dữ liệu sinh học hạn chế.
Tạo tiền đề lý luận và kỹ thuật cho các nghiên cứu mở rộng sang các loại PTM khác.

Về mặt thực tiễn:

Luận án đã cung cấp các mô hình và công cụ dự đoán PTM có độ chính xác cao, thời gian huấn luyện và suy luận hợp lý, có thể áp dụng trong các nghiên cứu sinh học phân tử, y sinh học và dược học. Việc phát hiện nhanh và chính xác các vị trí PTM trên protein không chỉ hỗ trợ hiểu biết cơ chế phân tử của bệnh mà còn góp phần thúc đẩy phát triển thuốc và liệu pháp điều trị.

Luận án đã đạt được các kết quả mới sau:

Đề xuất cơ chế biểu diễn protein bằng NLP thay thế đặc trưng thủ công.
Đề xuất mô hình RSX_SUMO dựa trên học máy tổ hợp để dự đoán SUMOylation.
Đề xuất hai mô hình học sâu lai có độ chính xác và hiệu quả cao: Mô hình CLW_SUMO cho dự đoán SUMOylation và mô hình CBiLSuccSite cho dự đoán Succinylation.
Đề xuất mô hình học sâu dựa trên chắt lọc tri thức và kỹ thuật xử lý ngôn ngữ tự nhiên (có tên gọi là KD_ArapUbi) để dự đoán hiệu quả Ubiquitination.
Công khai dữ liệu và mã nguồn, tạo điều kiện thuận lợi để hỗ trợ cho cộng đồng nghiên cứu.

Những kết quả này không chỉ mang ý nghĩa học thuật mà còn có giá trị thực tiễn, góp phần nâng cao năng lực phân tích PTM trong bối cảnh khoa học dữ liệu và sinh học hiện đại, đồng thời mở ra nhiều hướng nghiên cứu tiềm năng tiếp theo (ví dụ như mở rộng sang các PTM khác, tích hợp thông tin cấu trúc bậc cao của protein, và phát triển công cụ triển khai trực tuyến phục vụ cộng đồng khoa học).

Các hướng nghiên cứu tiếp theo:

Từ những kết quả đạt được, luận án định hướng một số hướng nghiên cứu và phát triển tiếp theo như sau:

1. Tiếp tục nghiên cứu các phương pháp khác nhau để cải tiến, nâng cao độ chính xác của mô hình: Kết hợp thêm các kiến trúc học sâu tiên tiến, tối ưu hóa siêu tham số và bổ sung thông tin sinh học để cải thiện hiệu quả dự đoán.

2. Xử lý vấn đề dữ liệu mất cân bằng: Áp dụng các kỹ thuật hiện đại như oversampling, undersampling, focal loss hoặc loss weighting,… nhằm giải quyết sự chênh lệch giữa mẫu dương và âm, nâng cao độ tin cậy của mô hình khi triển khai thực tế.

3. Mở rộng phạm vi bài toán sang các PTM khác: Phát triển mô hình cho các loại PTM quan trọng khác như Methylation, Acetylation, Phosphorylation… để hướng tới một hệ thống dự đoán PTM toàn diện.

4. Tích hợp thông tin của các cấu trúc protein bậc cao: Khai thác dữ liệu cấu trúc bậc hai, ba, bốn của protein để tăng độ chính xác, phản ánh đúng đặc tính không gian của PTM.

5. Phát triển phần mềm và công cụ ứng dụng: Xây dựng các công cụ thân thiện, có thể triển khai trực tuyến, nhằm đưa các mô hình dự đoán PTM phục vụ trực tiếp cộng đồng nghiên cứu trong sinh học phân tử và y sinh học.

INFORMATION ON DOCTORAL THESIS

Full name: Tran Thi Xuan

Officical thesis title: “Enhancing the Effectiveness of Protein Post-Translational Modification Analysis Based on the Integration of Machine Learning Models and Natural Language Processing”

Major: Computer Science Code: 9 48 01 01

Supervisors:

1. Assoc. Prof. Dr. Le Nguyen Quoc Khanh

2. Dr. Nguyen Van Nui

Summary of the new findings of the thesis

Research objectives:

The main objective of this dissertation is to research, propose, and develop novel computational models to enhance the prediction efficiency of post-translational modifications (PTMs) in proteins. This is one of the central issues in modern bioinformatics, as PTMs play a crucial role in regulating biological functions, directly affecting cellular development, intracellular signaling, and human diseases.

1. Proteins PTMs, focusing on three particularly important types: SUMOylation, Succinylation, and Ubiquitination. These are common PTMs with relatively sufficient data; Simultaneously there are still gaps to be addressed for deeper scientific understanding. Due to the lack and inconsistency of higher-order protein structural data (secondary, tertiary, quaternary) in current databases, the dissertation uses primary sequence data in FASTA format as input. This format is widely available, resource-efficient, and suitable for modern machine learning, deep learning, and natural language processing (NLP) techniques.

2. Computational models for the prediction of PTM sites using machine learning combined with NLP: The most common techniques for predicting PTM sites, which are currently highly accurate, are mass spectrometry and sequencing. However, these techniques are costly, time-consuming, and difficult to apply simultaneously to many proteins. Therefore, studying PTM prediction models based on machine learning combined with NLP is an appropriate approach, as it leverages advances in machine learning and NLP techniques to shorten the required time and effectively support biological and medical researchers in making rapid and accurate conclusions. Moreover, this approach demonstrates its alignment with the current needs and development trends of AI.

Research methodology:

To achieve the research objectives, this dissertation pursues two main directions in parallel: (1) Theoretical study to propose new models, and (2) Experimental study to validate the effectiveness of these models.

1. Theoretically, this dissertation inherits and develops modern methods of data science and AI, including: NLP to represent the “protein language’ and extract semantics, contextual information from amino acid sequences; Hybrid deep learning architectures combining CNN and LSTM/Bi-LSTM to simultaneously leverage the ability to detect local features and capture long-term sequential dependencies; Knowledge Distillation to construct lightweight models adapted to limited data; and Ensemble Learning to exploit the advantages of multiple classifiers in order to improve prediction reliability.

2. Experimentally, the proposed models are trained and validated on real PTM data, and then compared with state-of-the-art methods. The evaluation results confirm the feasibility, effectiveness, and practical significance of the proposed models in predicting post-translational modification sites on proteins.

Scientific significance:

1. Contributing to confirm the feasibility and effectiveness of NLP when applied to protein data.

2. Providing experimental evidence of the advantages of hybrid deep learning architectures and knowledge distillation in the context of limited biological data.

3. Establishing theoretical and technical foundations for extending the research to other types of PTMs.

Practical significance:

The dissertation has provided PTM prediction models and tools with high accuracy, and reasonable training and inference times, which can be applied in molecular biology, biomedical, and pharmaceutical research. The rapid and accurate identification of PTM sites on proteins not only supports the understanding of molecular mechanisms of diseases but also contributes to drug development and therapeutic advancements.

The dissertation has achieved the following new results:

1. Proposed a mechanism for protein representation using NLP instead of handcrafted features.

2. Proposed the RSX_SUMO model based on ensemble learning for the prediction of SUMOylation.

3. Proposed two highly accurate and efficient hybrid deep learning models: The CLW_SUMO model for the prediction of SUMOylation and the CBiLSuccSite model the prediction of Succinylation.

4. Proposed a deep learning model based on Knowledge Distillation and NLP techniques (named KD_ArapUbi) for effectively predicting Ubiquitination.

5. Released datasets and source code to facilitate and support the research community.

6. These results are not only of academic significance but also of practical value, contributing to the enhancement of PTM analysis capabilities in the context of modern data science and biology, while also opening up many potential future research directions (e.g., extending to other PTMs, integrating higher-order structural information of proteins, and developing online deployment tools to serve the scientific community).

Futher research directions

Building upon the results achieved, the dissertation outlines several potential directions for further research and development as follows:

1. Continue investigating different methods to improve and enhance model accuracy: Integrating advanced deep learning architectures, optimizing hyperparameters, and incorporating biological information to improve prediction performance.

2. Address the issue of imbalanced data: Applying modern techniques such as oversampling, undersampling, focal loss, or loss weighting,… to resolve the disparity between positive and negative samples, thereby increasing the model’s reliability in real-world deployment.

3. Expand the scope of the problem to other PTMs: Developing models for other important PTM types such as Methylation, Acetylation, and Phosphorylation, aiming toward a comprehensive PTM prediction system.

4. Integrate information of higher-order protein structures: Leveraging secondary, tertiary, and quaternary protein structure data to improve accuracy and better reflect the spatial characteristics of PTMs.

5. Develop software and application tools: Building user-friendly tools that can be deployed online, enabling PTM prediction models to directly serve the molecular biology and biomedical research community.

Nguồn: Trường Đại học Công nghệ thông tin và Truyền thông, Đại học Thái Nguyên.

Tin đào tạo Sau đại học

Trang thông tin luận án của Nghiên cứu sinh Trần Thị Xuân

Bài viết liên quan

Liên hệ

Liên kết hữu ích

ĐHTN

Thống kê truy cập