The legal industry faces a challenge in efficiently finding relevant information from millions of caselaw documents. One such challenge where Lawyers often struggle is determining the expertise of expert witnesses and identifying the specific topics and issues they can comment on. We have developed a solution that extracts key phrases and then identifies common facts and subjects from a large corpus of expert-associated caselaw documents, such as depositions, opinions, CVs, reports, and jury verdicts.
Our solution pipeline has two major parts: general data pre-processing on AWS EMR using Spark and last mile processing using Python on AWS Sagemaker. In our proof-of-concept (POC) study, we experimented with Ray as an alternative to AWS Sagemaker for the last-mile processing, which involves feature generation, unsupervised learning, and natural language processing (NLP) techniques.
As an example, consider legal opinions. These tell the story of the case: what the case is about, how the court is resolving the case, and why. For this use case, there is whole lot of information/text that we want to discard and extract only the key phrases comprising of information on what an expert has said on a particular subject matter or issue.
In the first component of our solution pipeline - key phrase extraction, we deployed an algorithm that ranks phrases and removes unwanted ones. We used the Spacy library for NLP-based pre-processing and Ray Dataset API to speed up the processing time. We saw a 5x reduction in processing time to rank and filter unwanted phrases.
Using a pre-trained language model, we further removed phrases that were not relevant in helping end users understand 'what an Expert Witness is an expert in' or 'what exactly they can comment on'. We achieved a 24x reduction in processing time to filter unwanted phrases using Ray Dataset and ActorPoolStrategy.
Finally, using an unsupervised learning algorithm, we processed the remaining phrases into similar key phrases. To improve the quality of key phrases, we removed those of low quality that didn't add any value and optimized the number of facts and subjects for each expert, achieving an 11x reduction in compute time using Ray.
Our proof-of-concept (POC) study with Ray was promising and resulted in a faster and more efficient way of identifying relevant facts and subjects. In the next few months, we would be scaling out the pipeline on multi-node Ray cluster. We would be happy to share our insights on using Ray in production at the conference in September.
Harshit is currently working as a Senior Data Scientist in the Data Science & Engineering team at LexisNexis where he leverages advanced NLP techniques in uncovering insights and patterns from millions of caselaw documents to help lawyers win cases and grow their practices. He has rich technical experience in the ML lifecycle including processing text data, model development, evaluation of end-to-end ML systems, etc. Harshit earned his MS in Data Science from the University of San Francisco and a Bachelor of Technology from Indian Institute of Technology Guwahati.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.