Comparing AI Summarization Models: Bruce Lee Edition
What happens when competing models try to summarize the esoteric philosophy of Bruce Lee?
Summarizing documents — or sets of documents — is an important task for language models, but different models are better at different kinds of summarization and in different contexts. I've created a few AI-driven summarization models for clients recently, so I thought I'd compare some of the solutions out there. If you find this useful, or need some help navigating this space for your business, send me an email!
First off, the text to be summarized. I thought I'd give these models a bit of a challenge to see how they handle things. I recently started reading Bruce Lee's The Tao of Jeet Kune Do. (Jeet Kune Do is the martial art Bruce Lee invented after studying various martial arts earlier in his career.) The book begins with several series of sentences and short paragraphs on Zen Buddhism and the philosophy behind Jeet Kune Do. They are often koan-like and fairly esoteric, especially in translation, and they do not directly relate to each other. Coming up with a coherent summary of the passage would be a challenge for a human reader! So I thought I'd throw it at a couple language models and see how well they perform under pressure!
The text I used as the source comes from the section "Empty Your Mind: On Zen" at the beginning of The Tao of Jeet Kune Do. (See the complete text at the end of this article.
OpenAI vs. Google
I chose to attempt the summarization using OpenAI's GPT model and Google's T5 model. Without going into too much technical detail, these are simply two of the best summarizers out there, with a couple key differences. First, T5 is open-source, meaning you can run it on your own computer for free and you don't even need a GPU! OpenAI's GPT, however, is a pay-per-use model that you use via an API. (See my A.I. Glossary for definitions of these terms, if you're unfamiliar.) The other key difference is that T5 is trained specifically for summarization, where GPT is a more universal model. That means that using GPT to summarize requires both software engineering (writing code) and prompt engineering: crafting and testing a set of instructions to send to GPT along with the text to summarize.
Because each of these models come in multiple versions and have parameters that can be tweaked, I tried a few different versions for each one. Following are the two best outputs from each model, with some comments on what did/didn't work, and when you might or might not want to incorporate each model.
GPT 3.5 (gpt-3.5-turbo)
Cost: $0.0022 (USD)
GPT 3.5 Turbo is almost the latest and greatest from OpenAI. They recently opened up beta API access to GPT version 4, but I was still on the wait list when I ran these tests. (Perhaps a followup post will report on any improvements from version 4.) But they also recently pushed some improvements to version 3.5, along with a price reduction. So for most people building products on GPT, this is the model you want to use.
To generate a summary, I created a prompt that included the following instructions, plus the source text from Bruce Lee (at the end of this article):
Summarize the following list of sentences as a single coherent paragraph, approximately 250 words. It is ok to leave some information out, but all information included must be accurate according to the text provided. The response cannot have more than a single paragraph.
I settled on these specific instructions after some testing that produced things like inaccurate information (especially via ChatGPT's human interface), too many paragraphs, etc. I also included a temperature setting of 0.0. Temperature is a numerical value that tells GPT how creative to be. ChatGPT purposefully has a higher setting, causing it to be less predictable, and at times less truthful. OpenAI did this on purpose, at least in part to get feedback from users on good and bad responses, which they can feed back into development to make the model better. When we set the temperature at or near zero, there are fewer "hallucinations" and it's easier to get GPT to truthfully say "I don't know" when it, in fact, doesn't know.
Here's what GPT 3.5 Turbo produced based on the above prompt and temperature=0.0:
Enlightenment in martial arts requires the elimination of anything that obscures true knowledge and real life, while emphasizing the totality that unites all aspects. The proper use of the mind and will is necessary to transcend karma and realize the oneness of all life. Voidness is all-inclusive and living, and those who realize it are filled with life, power, and love. To achieve this, one must turn into a doll made of wood, without ego or grasping. Moving like water and responding like an echo, one must let go of the consciousness of self to properly execute physical action. The localization of the mind is freezing, and the immovable is the concentration of energy at a given focus. Art reaches its greatest peak when devoid of self-consciousness, and wisdom is learning to ride the good and evil like waves. In Buddhism, there is no place for using effort, and one must establish nothing in regard to oneself. There is no fixed teaching, only appropriate medicine for a particular ailment.
This is pretty good! In all honesty, it's more readable than the original, at least in English translation, though it does take some of the Zen mystery out of it. And while some of its wording might be more confusing than the original, especially with terms and phrases that have specific meanings in Zen or martial arts, it didn't introduce any "hallucinations" or false statements. Further prompt engineering may improve the output still more. And at 1/5 of a cent, it's pretty affordable, even if you had to run this operation at scale ($2,200 for 1 million summarizations of similar length and complexity).
GPT DaVinci
Cost: $0.0226 (USD)
Before GPT 3.5 Turbo, DaVinci was the latest and greatest from OpenAI, and it's still pretty good! The biggest difference for tasks like summarization is the cost. Though the most recent version of DaVinci was both better and cheaper than previous versions of GPT 3, it's a full 10x more expensive than GPT 3.5 Turbo for summarization tasks! So unless the results for a specific task are a lot better with DaVinci, we're going to want to use GPT 3.5 Turbo for most things.
So how did it perform? With the same prompt and temperature setting, it produced the following summary:
Enlightenment in martial arts is the attainment of true knowledge and real life, which is achieved through the proper use of the mind and will, and the annihilation of false notions of a separate self. It is a living void, where nothing is excluded or opposed, and one must become like a doll made of wood, with no ego or thoughts, and let the body and limbs work themselves out. To be successful, one must be like water, a mirror, and an echo, and be free of self-consciousness and the struggle between “for” and “against.” Wisdom is found in learning to “ride” the good and evil, and in letting go and not seeking. In Buddhism, there is no place for effort, and one must be ordinary and nothing special. There is no fixed teaching, only an appropriate medicine for a particular ailment. Ultimately, enlightenment is found in the oneness of all life, and the realization that nothing within you should stay rigid.
This is actually pretty close to what GPT 3.5 Turbo produced. If I were manually editing the results, I'd probably take some key phrases from each of them, but if I had to choose one over the other, I'd take GPT 3.5 Turbo. And if I were building a product, I'd want to run more tests on a variety of source texts before deciding between the two, given how close these results are.
However... compared to GPT 3.5 Turbo, DaVinci is incredibly expensive. Factoring in cost, GPT 3.5 Turbo is still the clear frontrunner.
T5 (Base Model)
Cost: FREE (sort of)
My go-to open-source model recently for text summarization is T5 (which stands for Text-To-Text Transfer Transformer), developed at Google. It's a great model, and I've had a lot of success with it, particularly when summarizing a group of sentences, social media posts, etc. that are similar in some way. (For example, different people saying the same thing in different ways on social media.)
What I really like about T5 is that I don't need to connect to an API or give anyone any money to use it. I can run it on my local machine, and if I need an extra boost for a larger project, I can fire up a Google Colab notebook for some GPU time. But most of the time, I don't even need to do that.
I also like that it's not prompt-based. You just write code and pipe your source text in.
So how did it fare with Bruce Lee? Here's what it produced:
The way to transcend karma lies in the proper use of the mind and the will. the struggle between "for" and "against" is the mind's worst disease. if you want the truth to stand clear before you, never be for or against. if you want to see a thing uncolored by one's own personal preferences and desires, do not like, do not dislike; all will then be clear, says dr. sanjay gupta.
Things were going so well!
Needless to say, Bruce Lee did not reference Dr. Sanjay Gupta in The Tao of Jeet Kune Do. That would be a fabrication from the model, and a pretty serious one if we were considering rolling it out in production for these kinds of summarization tasks and doing so without fine-tuning the model on our data. There are most definitely tasks where T5 comes up aces, particularly when you have the time and resources to fine-tune the model on a dataset more specific to your use case. But I think the diffuse nature of Lee's text, which I described above, was too much of a challenge for the base model out of the box.
And that means that we can't really say the open-source model is "free." To make it usable, at least in this case, we have to devote some human and computational resources to fine-tuning it — moreso than we need to with GPT, at least for this example. That means we're offsetting software and API costs for other costs, which may end up being more significant in the long run.
When the open-source model doesn't fit your needs out-of-the-box, you need to spend some time evaluating what the development costs associated with fine-tuning it will be, and whether or not that cost exceeds something like GPT 3.5 Turbo.
So far, then, GPT 3.5 Turbo is still the frontrunner.
T5 (Large Model)
Cost: FREE (sort of)
Google has released multiple versions of T5. (And that's to say nothing of the many versions available from users who have fine-tuned the model on custom data.) Could it be that another version of T5 will work for us?
Rather than try every T5 I could get my hands on, I used t5-large — a heavyweight version of the t5-base model tested above. Hypothetically, it should function in similar ways, but at a higher level of nuance or accuracy.
Simply trading out "t5-base" for "t5-large" in my code, here is the new result:
To obtain enlightenment in martial art means the extinction of everything which obscures the "true knowledge," the "real life" the way to transcend karma lies in the proper use of the mind and the will. the localization of the mind means it's freezing; when it ceases to flow freely as it is needed, it is no more the mind in its suchness. an assertion is Zen only when it is itself an act and does not refer to anything that is asserted in it.
This is certainly better than t5-base. No hallucinations, at least. But it's still not as good as GPT 3.5 Turbo, which at 0.22 cents is hardly going to break the bank, even if we're summarizing a few thousand documents.
And the winner is...
In my mind, GPT 3.5 Turbo is the clear winner. It's simultaneously the most detailed and the most readable. And given the development costs associated with fine-tuning T5, for many use cases it is actually the most affordable.
However, there are situations where:
- GPT 3.5 Turbo isn't good enough, in which case T5 may be easier to fine-tune.
- You have so many documents that fine-tuning T5 ends up being cheaper.
- The question isn't How much should we spend on this project? but What projects should we prioritize for the salaried team we are already paying for? And in that case, the financial equation isn't so simple, and a fine-tuned T5 may be a better overall business decision.
Bruce Lee was a fun thought experiment for me, but it grew out of real conversations I've been having with clients and real evaluations I've been putting tools through for specific projects. And in those specific projects, there isn't always a clear winner. And the winner isn't always the same.
For example, I used t5-base out of the box for a recent client project. The input data was better suited to T5 than the Bruce Lee passage here, the output was sufficient for the use case, and it ran fast and lightweight without any additional API or infrastructure costs. But for another project I'm currently scoping out, GPT 3.5 Turbo looks like it may be the better solution.
At the end of the day, there's a reason all of these different models exist. Some are improvements on previous models, some are improvements for specific purposes, and some are easier to customize. Taking the time to evaluate the options as they pertain to your specific situation can save a lot of headache, and even money, down the road.
Need help evaluating or implementing language models in your work? Get in touch!
And here's that Bruce Lee source text...
To obtain enlightenment in martial art means the extinction of everything which obscures the “true knowledge,” the “real life.” At the same time, it implies boundless expansion and, indeed, emphasis should fall not on the cultivation of the particular department which merges into the totality but rather on the totality that enters and unites that particular department.
The way to transcend karma lies in the proper use of the mind and the will. The oneness of all life is a truth that can be fully realized only when false notions of a separate self, whose destiny can be considered apart from the whole, are forever annihilated.
Voidness is that which stands right in the middle between this and that. The void is all-inclusive, having no opposite—there is nothing which it excludes or opposes. It is living void because all forms come out of it and whoever realizes the void is filled with life and power and the love of all beings.
Turn into a doll made of wood: it has no ego, it thinks nothing, it is not grasping or sticky. Let the body and limbs work themselves out in accordance with the discipline they have undergone.
If nothing within you stays rigid, outward things will disclose themselves. Moving, be like water. Still, be like a mirror. Respond like an echo.
Nothingness cannot be defined; the softest thing cannot be snapped.
I’m moving and not moving at all. I’m like the moon underneath the waves that ever go on rolling and rocking. It is not, “I am doing this,” but rather, an inner realization that “this is happening through me,” or “it is doing this for me.” The consciousness of self is the greatest hindrance to the proper execution of all physical action.
The localization of the mind means it’s freezing. When it ceases to flow freely as it is needed, it is no more the mind in its suchness.
The “Immovable” is the concentration of energy at a given focus, as at the axis of a wheel, instead of dispersal in scattered activities.
The point is the doing of them rather than the accomplishments. There is no actor but the action; there is no experiencer but the experience.
To see a thing uncolored by one’s own personal preferences and desires is to see it in its own pristine simplicity.
Art reaches its greatest peak when devoid of self-consciousness. Freedom discovers man the moment he loses concern over what impression he is making or about to make.
The perfect way is only difficult for those who pick and choose. Do not like, do not dislike; all will then be clear. Make a hairbreadth difference and heaven and earth are set apart; if you want the truth to stand clear before you, never be for or against. The struggle between “for” and “against” is the mind’s worst disease.
Wisdom does not consist of trying to wrest the good from the evil but in learning to “ride” them as a cork adapts itself to the crests and troughs of the waves.
Let yourself go with the disease, be with it, keep company with it—this is the way to be rid of it.
An assertion is Zen only when it is itself an act and does not refer to anything that is asserted in it.
In Buddhism, there is no place for using effort. Just be ordinary and nothing special. Eat your food, move your bowels, pass water and, when you’re tired, go and lie down. The ignorant will laugh at me, but the wise will understand.
Establish nothing in regard to oneself. Pass quickly like the non-existent and be quiet as purity. Those who gain, lose. Do not precede others, always follow them.
Do not run away; let go. Do not seek, for it will come when least expected.
Give up thinking as though not giving it up. Observe techniques as though not observing.
There is no fixed teaching. All I can provide is an appropriate medicine for a particular ailment.