Whether you're a developer troubleshooting complex issues or a team lead tracking recurring technical challenges, sifting through lengthy forum discussions can take up valuable time. This tutorial shows how to use KNIME and Generative AI to automatically summarize forum posts and their corresponding answers from StackOverflow. After retrieving high-voted questions, this workflow leverages Large Language Models (LLMs) to summarize questions’ key content, and stores the results–including URLs and other relevant information–in a MongoDB database.
This solution saves time and effort by automating data retrieval, summarization and storage using Generative AI and KNIME Analytics Platform. KNIME is an open source and free data science platform where you use visual workflows to build your applications, meaning you don’t need any coding knowledge to make sense of the data.
This blog series on Summarize with GenAI showcases a collection of KNIME workflows designed to access data from various sources (e.g., Box, Zendesk, Jira, Google Drive, etc.) and deliver concise, actionable summaries.
Here’s a quick 1-minute video that gives you a quick overview of the workflow. You can download the example workflow here, to follow along as we go through the tutorial
Let’s get started!
Automate forum post summarization and database storage
The goal of this solution is to automate the summarization of forum posts and their answers retrieved from StackOverflow, thereby saving time and improving accessibility by generating concise summaries.
We can do this in three steps:
- Automatically retrieve top questions and answers from StackOverflow
- Prompt an LLM to create clear, concise summaries
- Store results in a NoSQL database, such as MongoDB
This approach provides a scalable way to enhance forum data processing and improve information retrieval for end-users.

Step 1. Access data: Retrieve top questions and answers from StackOverflow

We begin by using the Get Request node to automatically get forum post data from the StackOverflow API. The API call fetches the top 10 questions, sorted by the number of votes, from StackOverflow, since January 1st, 2000. To do that, we use the following URL:
This API call ensures that we fetch the highest-voted questions (with a minimum of 1000 votes) along with their metadata, including the question title, URL, user name, and ID of accepted answer. We retain only the 10 most-voted questions and parse the JSON response of the GET request with the JSON Path node, converting the retrieved data into a structured dataset suitable for further processing.
Next, using the question URL, we scrape iteratively the full text of the top 10 questions and their corresponding answers using the Webpage Retriever node. This ensures that we gather all the necessary data and metainfo that serve as the foundation for the summarization process.

Step 2. Prompt LLM: Summarize forum posts with OpenAI’s GPT-4o-mini

Once the full text of the forum posts is extracted, we use the KNIME AI extension to select the most suitable LLM for the task, balancing costs and performance. For example, OpenAI’s GPT-4o-mini is one option. Other alternatives are also possible, including open-source, local models.
To establish the connection:
- Input the API key in the Credentials Configuration node
- Authenticate to the service using the OpenAI Authenticator node
- Connect to the GPT-4o-mini model using the OpenAI Chat Model Connector node
After establishing the connection to the model, we proceed to engineer a parameterized prompt using the Expression node. The prompt instructs the LLM to summarize both the question and the corresponding answer clearly and concisely, following prompt engineering best practices. Here’s the prompt we used:
join("\n\n", "Act as a forum support expert and summarize the following question and corresponding answer. If a code snippet is part of the answer, just advise to check the original answer to learn about the snippet:",
join("", "-Question title: ", $["Question"]), join("", "-Question text: ", $["Question full text"]), join("", "-Answer text: ", $["Answer full text"])).
The LLM Prompter node then sends the query to the LLM, generating concise forum post summaries and allowing for quicker insights.
Step 3. Deploy results: Store summaries in MongoDB

The final step of the workflow involves storing the summarized question and answer data, along with other relevant metadata, in a MongoDB database.
In the “Post-processing”metanode, we split the generated question and answer summaries into separate columns, remove trailing spaces, and create hyperlinks to the full texts. We then use the Table to JSON node to convert the post-processed table into a JSON object, a format required for storing data in MongoDB.
Next, we establish a connection to a local instance of MongoDB. To do that, we drag-and-drop the MongoDB Connector node. The node requires the configuration of a hostname and port. For local instances of MongoDB, the hostname is 127.0.0.1, whereas the port takes the default value of 27017, which is the standard port for MongoDB services.
Once the connection is established, we use the MongoDB Writer node to store the summarized question and answer texts and metadata. In the node, we specify the relevant database and collection, and select the JSON object with the summaries to populate the collection.
This stored data can then be easily retrieved or queried for further analysis or reporting.
The result: Streamlined forum data management

This workflow automates the retrieval and summarization of StackOverflow questions and answers, as well as the storage of summaries and forum post metadata in MongoDB. With it, we can save time and get a structured view of valuable forum interactions, making it easier to navigate and analyze them.
GenAI for summarization in KNIME
In this article from the Summarize with GenAI series, we explored how KNIME and GenAI can be used to automatically generate forum post summaries and store them in MongoDB for further analysis. This process helps streamline forum data management and improve accessibility, enabling us to stay always informed on the latest, valuable interactions.
You learned how to:
- Retrieve top forum questions and answers from StackOverflow
- Use an LLM to summarize them
- Store the results in MongoDB for efficient data retrieval
Download KNIME Analytics Platform and try out the workflow yourself!