What We Do

Machine Learning

ArcTEX is our sophisticated NLP model designed to cater to a diverse array of clinical reports. Our innovative solution empowers real-world evidence studies by seamlessly automating the extraction of biomarkers and other disease-specific data at scale.

About ArcTEX

At Arcturis, our team of researchers have pioneered the development of ArcTEX (Arcturis Text Enrichment and Extraction) model. ArcTEX is a flexible Natural Language Processing (NLP) framework, engineered to systematically extract biomarker and disease-specific data from unstructured clinical reports. Notably, ArcTEX stands out for its versatility, and is capable of being easily finetuned to cater to diverse project needs.

ArcTEX Overview

Numerous studies have demonstrated the vast reservoir of clinical insights hidden within unstructured textual data, such as clinical letters or pathology reports (1). Unfortunately, this highlights the volume of data unavailable for direct analysis. At Arcturis, our team of researchers have pioneered the development of ArcTEX (Arcturis Text Enrichment and Extraction) model. ArcTEX is a flexible Natural Language Processing (NLP) framework, engineered to systematically extract biomarker and disease-specific data from unstructured clinical reports. Notably, ArcTEX stands out for its versatility, and is capable of being easily finetuned to cater to diverse project needs. Moreover, our model has been meticulously optimized to underpin high-quality real-world evidence (RWE) initiatives, ensuring robust and reliable outcomes across multiple disease areas.  

Many leading pharmaceutical companies enhance their clinical development and post-market launch strategies through the integration of real-world data. These can encompass a spectrum of methodologies, ranging from retrospective cohort studies to the optimisation of patient selection criteria or the incorporation of external control arms. However, a significant hurdle lies in the fact that a substantial portion of vital healthcare data required for these initiatives reside within unstructured textual formats, impeding direct accessibility for analysis. 

If we consider, for instance, critical biomarker statuses like human epidermal growth factor receptor-2 (HER2) or oestrogen and progesterone receptors (ER or PR). These biomarkers have a profound influence on the treatment trajectory for breast cancer patients. This crucial information often finds itself embedded within unstructured pathology reports, characterized by variations in style and content across different hospital sites and pathologists. This also applies to, for example, nuances regarding ‘response to treatment’ or ‘disease progression’, which further exemplify the breadth of which unstructured data can crucially inform treatment pathways.  

Unlocking these insights presents a formidable yet essential challenge in advancing pharmaceutical research and patient care. To meet the evolving needs of these communities, Arcturis are proud to introduce ArcTEX, a sophisticated NLP model designed to cater to a diverse array of clinical reports. Our innovative solution empowers real-world evidence studies by seamlessly automating the extraction of biomarkers and other disease-specific data at scale. ArcTEX stands as a testament to our commitment to advancing healthcare research through cutting-edge technology, to ensure that the availability of high-quality data is at the forefront of evidence decision making.  

 

Approach and Results

Our approach in the creation of our innovative model is based on recent developments in the field of machine learning and natural language processing, utilising transformer-based language models. In contrast to other large language models (LLMs), ArcTEX significantly benefits in not suffering from hallucinations and providing additional confidence scores for each extracted value. The model is also optimised on UK specific data for a range of biomarkers. Further optimisation is also possible through our iterative optimisation and validation framework, which can provide insights in the robustness of ArcTEX.   

As an example, the figure below shows the results of ArcTEX on the detection and extraction of information for p53, which is a protein associated with tumour suppression. Even without any optimisation ArcTEX outperforms other baseline models (BERT and BioBERT) with an accuracy of >94% (iteration 0). The performance can be further increased to 98.4% by adding a few training examples at each iteration of the optimisation process, increasing overall accuracy above that of human annotation (2). The robustness of the models can be determined (shaded areas in the graph) through random permutation of the data to increase the confidence in the approach and support regulatory submissions

 

Model Performance

Additionally, ArcTEX provides insights in the model performance for each evaluated free text report. By exploring the confidence scores which the model provides, reports can be automatically identified which have a risk of being misclassified. These reports can be flagged for either manual review or exclusion from the analysis, resulting in an overall accuracy increase.  Compared with other baseline methods, ArcTEX can better pinpoint which reports require manual review, as illustrated in the next figure. This reduces significantly the time required to perform additional manual reviews.

Compared to other LLMs, ArcTEX provides multiple advantages, making it very well placed for a range of RWE studies from a scientific and practical standpoint, such as:

  • High accuracy: the model is optimised to extract biomarker and disease-specific data from clinical free text reports which makes the model superior compared to many generic LLMs. The degree of otherwise unavailable granular data that can be extracted and analysed using our model is key to be able to generate useful scientifically robust epidemiological country-specific data to speed up reimbursement and patient access.
  • High flexibility: the developed optimisation framework can be used to further optimise ArcTEX on other biomarkers depending on project requirements within hours, which can be highly beneficial from both a cost and efficiency perspective.
  • Transparency: ArcTEX provides confidence scores for each evaluated report, allowing quick identification of reports which might require further manual review. Furthermore, the optimisation framework provides insight into the accuracy and variability of the data extraction. This can be used to provide evidence in the robustness of the methodology to support studies.

In summary, ArcTEX emerges as a pinnacle of innovation in the realm of NLP, offering flexibility and precision in the extraction of biomarkers and supplementary disease information from an expansive spectrum of unstructured clinical text. Due to its remarkable capacity to quickly evaluate a vast volume of reports with unparalleled accuracy, as well as integrated robustness matrices and confidence scores, it as an indispensable asset for high-quality RWE generation.

Opportunities for RWE generation exist throughout the entire development cycle, from informing internal strategy (e.g. early phase TPP development), all the way to late-stage external control arms, playing a crucial role in regulatory decision making. ArcTEX can be utilised across this entire cycle to unlock the true potential of unstructured data and overcome barriers to analysis, ensuring robust and reliable outcomes for a variety of stakeholders, across multiple disease areas.

References

(1) Hyoun-Joong Kong, Managing Unstructured Big Data in Healthcare System, Healthcare Informatics Research, 2019

(2) Marie-Pier Gauthier et al.:  Automating Access to Real-World Evidence JTO Clin Res Rep, 2022