We have thought about this process over the last few weeks. The roadblock with any AI project is clean data in/out. There is no standardization by the SEC or user fillings for these legacy applications. Here is our approach to helping AI learn to make better application submissions. We would create an overall workflow that looks like the following:
With the following architecture. (not the greatest drawing, sorry.)
My concern is that this is all needed to build the data pipeline to get the documents into a system that can then extract info and pass it to an LLM for processing. There isn’t any custom training or tagging, meaning the outcomes are unknown once you do it.
After this phase is done, we’d analyze the performance and then suggest a path forward from there, which would be either:
(1) tweak the prompting and continue to use retrieval augmented generation
(2) move to fine-tuning an LLM
This could be easy or incredibly complex. It would not be known until we test it.