Date posted: 13/08/2025 6 min read

The new rules of generative AI prompt engineering

As generative AI shifts to multimodal voice and vision, users must adapt their prompt engineering techniques.

In brief

While early generative AI models relied on text-only prompts, many now incorporate voice and vision prompts.
Users can interact with large language models (LLMs) through speech, images and dialogue, which can yield valuable, context-rich insights if your prompts are clear and structured.
Two frameworks – conversational prompting and vision-voice prompting – can help professionals collaborate with LLMs and work more efficiently.

Words by Tristan Tan CA.

Prompt engineering has long been viewed as a much-needed technical skill to engage with large language models (LLMs) and generative AI agents. The essence of prompt engineering is the ability to write clear, structured prompts to guide generative AI for more accurate outputs.

Early generative AI models depended on precise, text-only prompts, but as these systems evolve to include voice, vision and tools, you can use new prompting techniques to match this multimodal paradigm.

Multimodal generative AI models like GPT-4o, Gemini 2.0 Flash and Claude 3.5 Sonnet are a departure from text-only prompting. Instead, users now have the option to work with these models through speech, image input and sequential dialogue. When used by professionals in law, tax or finance, conversational prompting can reap significant benefits. It allows the LLM to obtain more context through conversation during the back-and-forth dialogue process, leading to more insightful outputs. However, conversational inputs can backfire if they are vague or unstructured, leading LLMs to generate irrelevant outputs due to excessive contextual interpretation.

Two new modalities of prompt engineering – conversational prompting and vision-voice prompting – build on traditional prompt engineering techniques but are better suited to harness the latest generative AI models.

Conversational prompting: a framework for dynamic dialogue

Conversational prompting represents a natural evolution from text-based commands to adaptive, real-time dialogue with an LLM. This approach treats each prompt as a step in a multi-turn, context-aware exchange. Like human conversation, generative AI builds on previous context to redefine successive responses, leading to more tailored outputs.

The goal of conversational prompting is to simulate collaboration, based on clear objectives. A tax professional might begin by asking AI to summarise a particular section of the Income Tax Assessment Act 1997 to establish the foundation of the verbal prompting exchange, then refine the question iteratively by first requesting client-friendly language, then adding sector-specific examples and finally adjusting for recent legislative changes. Each turn in this exchange allows the LLM to reference the context window of prior instructions, ultimately leading to more relevant and accurate output overall.

Vision-voice prompting: expanding sensory input

Vision-voice prompting is a response to the integration of image and speech capabilities into LLMs. Users can now upload a photo, use built-in camera functions or even share their screen with AI and receive a multimodal analysis almost instantaneously. This interaction style mirrors human-to-human communication more closely than typing or speaking alone.

The power of this approach lies in generative AI’s ability to extract meaning from visual and auditory data almost simultaneously. For example, a lawyer could open a newly released ruling in the browser and ask, via voice, ‘What other cases have dealt with Part IVA in relation to treaties?’. With Copilot Vision, the LLM can instantly scan and provide other relevant case law, all without any typed input.

With this advancement, practitioners can think aloud, speak naturally, and allow the AI to handle transcription and synthesis in real time, significantly increasing productivity and removing the need to search multiple sources to find meaningful information.

Implications for professionals

The practical implications of these prompt engineering frameworks for tax, legal and financial professionals are significant. Conversational prompting enables rapid iteration of advisory memos, contract analysis and compliance workflows. Vision-voice prompting enhances real-world document analysis, such as reviewing financial reports, handwritten notes or regulatory submissions.

By embedding these frameworks into our practice, we can treat generative AI models as a responsive collaborator, rather than a static search tool. We can work together to find answers to complex problems. This increases efficiency and helps us consider generative AI’s analysis in more detail to ensure that all advisory risks are covered.

Notably, the user’s expertise, expressed through prompts, still provides the crucial context for the LLM. The introduction of conversational and vision-voice prompting frameworks represents an evolution in the capabilities of frontier LLMs to assist practitioners by introducing new ways to interact with generative AI.

As models become more multimodal and agentic, the prompt becomes a conversation, an instruction and feedback all at once. We need to adopt this mindset to fully harness AI’s potential, with future directions for prompt engineering including prompt orchestration across applications, real-time collaboration via wearable devices and agentic AI systems that initiate their own prompts based on user context. If this becomes a reality, prompting will no longer be about users talking to AI but instead about how practitioners and AI work together to achieve a common goal.

Prompt engineering is not a static skill set. Instead, it serves as a way for us to accustom ourselves to generative AI, as the stepping stone to using agentic AI systems in the future.

Learn more

Need to learn more about generative AI? CA ANZ offers a range of courses covering AI in finance, ethical considerations, implementing AI and AI governance, including the new Certificate in AI Fluency. Visit the eStore to see all the options.

Audio articles

Explore Acuity on Air, the playlist where the pages of Acuity magazine come to life.

Listen now

Search

The new rules of generative AI prompt engineering

In brief

Conversational prompting: a framework for dynamic dialogue

Vision-voice prompting: expanding sensory input

Implications for professionals

Learn more

Audio articles

Search related topics

AI in forensic accounting and valuations

AI and the future: Three TED Talks to watch

Data for a smarter future: Three TED Talks to watch

Search

In brief

Conversational prompting: a framework for dynamic dialogue

Vision-voice prompting: expanding sensory input

Implications for professionals

Learn more

Audio articles

Search related topics

Technology

AI in forensic accounting and valuations

AI and the future: Three TED Talks to watch

Data for a smarter future: Three TED Talks to watch