TL;DR:
The previous post in this ongoing series introduced developers to the Gemini API by providing a more user-friendly and useful "Hello World!" sample than in the official Google documentation. The next steps: enhance that example to learn a few more features of the Gemini API, for example, support for streaming output and multi-turn conversations (chat), upgrade to the latest 1.0 or even 1.5 API versions, and switch to multimodality... stick around to find out how!
Introduction
Are you a developer interested in using Google APIs? You're in the right place as this blog is dedicated to that craft from Python and sometimes Node.js. Previous posts showed you how to use Google credentials like API keys or OAuth client IDs for use with Google Workspace (GWS) APIs. Other posts introduced serverless computing or showed you how to export Google Docs as PDF. If you're interested in Google APIs, you're in the right place.
The previous post kicked off the conversation about generative AI, presenting "Hello World!" examples that help you get started with the Gemini API in a more user-friendly way than in the docs. It presented samples showing you how to use the API from both Google AI as well as GCP Vertex AI.
This post follows up with a multimodal example, one that supports streaming output, and another one leveraging multi-turn conversations ("chat"), and finally, another one that upgrades to using the latest 1.0 and 1.5 models, the latter of which is in public preview at the time of this writing.
Whereas your initial journey began with code in both Python & Node.js plus API access from both Google AI & Vertex AI, this post focuses specifically on the "upgrades," so we're just gonna stick with one of each: Python-only & only on Google AI. Use the previous post's variety of samples to "extrapolate" porting to Node.js or running on Vertex AI.
Prerequisites
The example assumes you've performed the prerequisites from the previous post:
- Installed the Google GenAI Python package with:
pip install -U pip google-generativeai
- Created an API key
- Saved API key as a string to
settings.py
asAPI_KEY = 'YOUR_API_KEY_HERE'
(and followed the suggestions for only hard-coding it in prototypes and keeping it safe when deploying to production)
For today's code sample, there a couple more packages to install:
- The popular Python HTTP
requests
library - The Python Imaging Library (PIL)'s flexible fork, Pillow
You can do so along with updating the GenAI package with: pip install -U Pillow requests google-generativeai
(or pip3
)
The "OG"
Let's start with the original script from the first post that we're going to upgrade here, gemtxt-simple-gai.py
:
Review the original post if you need a description of the code. This is the starting point for the remaining examples here.
Upgrade API version
The simplest update is to upgrade the API version. The original Gemini API 1.0 version was named gemini-pro
. It was replaced soon thereafter by gemini-1.0-pro
, and after that, the latest version, gemini-1.0-pro-latest
.
The one-line upgrade was effected by updating the MODEL
variable. This "delta" version is available in the repo as gemtxt-simple10-gai.py
. Executing it results in output similar to the original version:
$ python3 gemtxt-simple10-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model & prompt 'Describe a
cat in a few sentences'
A cat is a small, furry mammal with sharp claws and teeth. It
is a carnivore, meaning that it eats other animals. Cats are
often kept as pets because they are affectionate and playful.
They are also very good at catching mice and other small
rodents.
If you have access to the 1.5 API, update MODEL
to gemini-1.5-pro-latest
. The remaining samples below stay with the latest 1.0 model as 1.5 is still in preview.
Streaming
The next easiest update is to change to streaming output. When sending a request to an LLM (large language model), sometimes you don't want to wait for all of the output from the model to return before displaying to users. To give them a better experience, "stream" the output as it comes instead:
Switching to streaming requires only the stream=True
flag passed to the model's generate_content()
method. The loop displays the chunk
s of data returned by the LLM as they come in. To keep the spacing consistent, set Python's print()
function to not output a NEWLINE (\n
) after each chunk with the end
parameter.
Instead, keep chaining the chunks together and issue the NEWLINE after all have been retrieved and displayed. This version is also available in the repo as gemtxt-stream10-gai.py
. Its output here isn't going to reveal the output as it is streamed, so you have to take my work for it. :-)
$ python3 gemtxt-stream10-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model & prompt 'Describe a
cat in a few sentences'
A cat is a small, carnivorous mammal with soft fur, retractable
claws, and sharp teeth. They are known for their independence,
cleanliness, and playful nature. With its keen senses and
graceful movements, a cat exudes both mystery and intrigue. Its
sleek body is covered in sleek fur that ranges in color from
black to white to tabby.
Multi-turn conversations (chat)
Now, you may be building a chat application or executing a workflow where your user or system must interact with the model more than once, keeping context between messages. To facilitate this exchange, Google provides a convenience chat object, obtained with start_chat()
which features a send_message()
method for communicating with the model instead of generate_content()
, as shown below:
While the flow is slightly different from what you've already seen, the basic operations are the same: send a prompt to the model and await the response. The core difference is that you're sending multiple messages in a row, with each subsequent message maintaining the full context of the ongoing "conversation." This version is found in the repo as gemtxt-simple10-chat-gai.py
, and shown here is one sample exchange with the model:
$ python3 gemtxt-simple10-chat-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model
USER: Describe a cat in a few sentences
MODEL: With its sleek fur, piercing eyes, and playful
spirit, the cat exudes both elegance and mischief. Its
nimble body and graceful movements make it an agile
hunter, while its affectionate nature brings joy to any
household. Its curious and independent spirit ensures
that each day brings new adventures for this feline
companion.
USER: Since you're now a feline expert, what are the top
three most friendly cat breeds for a family with small
children?
MODEL: **Top 3 Most Friendly Cat Breeds for Families
with Small Children:**
1. **Ragdoll:** Known for their docile and affectionate
nature, Ragdolls are incredibly gentle and patient with
children. They love to be cuddled and enjoy spending
time with their human companions.
2. **Maine Coon:** Despite their large size, Maine Coons
are known for their sweet and playful personalities. They
are great with kids and are often described as gentle
giants. Their playful and curious nature makes them a joy
to have around.
3. **Siamese:** While Siamese cats are known for being
vocal, they are also highly intelligent and affectionate.
They form strong bonds with their family members,
including children, and enjoy being involved in all
aspects of family life.
I'm not a cat owner, so I can't vouch for Gemini's accuracy there. Add a comment below if you have a take on it. Now let's switch gears a bit.
So far, all of the enhancements and corresponding samples are text-based, single-modality requests. A whole new class of functionality is available if a model can accept data in addition to text, in other form factors such as images, audio, or video content. The Google AI documentation states that this wider variety of input, "creates many additional possibilities for generating content, analyzing data, and solving problems."
Multimodal
Some Gemini models, and by extension, their corresponding APIs, support multimodality, "prompting with text, image, and audio data". Video is also supported, but you need to use the File API to convert them to a series of image frames. You can also use the File API to upload the assets to use in your prompts.
The sample script below takes an image and asks the LLM for some information about it, specifically this image:
The prompt is a fairly straightforward query: Where is this located, and what's the waterfall's name?
. Here is the multimodal version of the script posing this query... it's available in the repo as gemmmd-simple10loc-gai.py
:
These are the key updates from the original app:
- Change to multimodal model: Gemini 1.0 Pro to Gemini 1.0 Pro Vision
- Import Pillow and use it to read the image data given its filename
- New prompt: pass in prompt string plus image payload
The MODEL
variable now points to gemini-1.0-pro-vision-latest
, the image filename is passed to Pillow to read its DATA
, and rather than a single PROMPT
string, pass in both the PROMPT
and image DATA
as a 2-tuple to generate_content()
. Everything else stays the same. Let's see what Gemini says:
$ python3 gemmmd-simple10loc-gai.py
** GenAI multimodal: 'gemini-1.0-pro-vision-latest' model & prompt
"Where is this located, and what's the waterfall's name?"
The waterfall is located in the Jewel Changi Airport in
Singapore. It is called the HSBC Rain Vortex.
Online data vs. local
The final update is to take the previous example and change it to access images online rather than requiring it be available on the local filesystem. For this, we'll use one of Google's stock images:
This one is pretty much identical as the one above, but uses the Python requests
library to access the image for Pillow. The script below asks Gemini to Describe the scene in this photo
and can be accessed in the repo as gemmmd-simple10url-gai.py
:
New includes the import of requests
followed by its use to perform an HTTP GET on the image URL (IMG_URL
), reading the binary payload into IMG_RAW
, which is passed along with the text prompt to generate_content()
. Running this script results in the following output:
$ python3 gemmmd-simple10url-gai.py
** GenAI multimodal: 'gemini-1.0-pro-vision-latest' model &
prompt 'Describe the scene in this photo'
A young Asian man is sitting at a desk in an office. He is
wearing a white shirt and black pants. He has a big smile on
his face and is gesturing with his hands. There is a laptop,
notebook, and pen on the desk. There is a couch and some
plants in the background. The man is probably giving a
presentation or having a conversation with someone.
I originally designed a sixth derivative sample script to turn the above into a multi-turn conversation chat app, intending to further query the model asking, You are a marketing expert. If a company uses this photo in a press release, what product or products could they be selling?
I discovered, unfortunately, that multi-turn conversation isn't supported by the Gemini 1.0 Pro Vision multimodal model:
$ python3 gemmmd-simple10url-chat-gai.py
Traceback (most recent call last):
. . .
. . .
File "/home/wescpy/.local/lib/python3.9/site-packages/google/
api_core/timeout.py", line 120, in func_with_timeout
return func(*args, **kwargs)
File "/home/wescpy/.local/lib/python3.9/site-packages/google/
api_core/grpc_helpers.py", line 78, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 Multiturn chat
is not enabled for models/gemini-1.0-pro-vision-latest
While multimodal chat isn't supported by Gemini 1.0, it is available via the 1.5 preview, however that is the exact conversation we'll be having in the next post. Here's an abridged preview:
$ python3 gemmmd-simple15url-chat-gai.py
** GenAI multimodal: 'gemini-1.5-pro-latest' model
USER: Describe the scene in this photo
MODEL: A young man is sitting at a desk in a modern
office, smiling and gesturing with his hands. He is
wearing a light blue shirt and has dark hair. On the
desk in front of him is a laptop, a notebook, a pen,
and a cell phone. There is a comfortable-looking sofa
and chair behind him, and large windows offer a view
of the city skyline.
USER: You are a marketing expert. If a company uses
this photo in a press release, what product or
products could they be selling?
MODEL: ## Products this photo could promote:
Given the positive energy and modern office setting,
this photo could be used to market a variety of
products or services related to:
**Technology and Productivity:**
* **Project management software:** The man's
expression and gestures suggest successful
completion or organization.
* **Communication and collaboration tools:** The open
office layout hints at teamwork and connectivity.
* **Cloud-based services:** The laptop and modern
setting imply a reliance on technology and online
solutions.
* **Productivity apps or time management tools:**
The organized desk and the man's focused demeanor
suggest efficiency and control.
. . .
. . .
Summary
Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous post in the series got your foot in the door, presenting a more digestible user-friendly "Hello World!" sample to help developers get started.
This post presents possible next steps, providing "102" samples that enhance the original script, furthering your exploration of Gemini API features but doing so without overburdening you with large swaths of code.
More advanced features are available via the Gemini API we didn't cover here — they merit separate posts on their own:
The next post in the series focuses on Gemini's responses and explores the differences between the 1.0 and 1.5 models' outputs across a variety of queries, so stay tuned for that. If you found an error in this post or have a topic you want me to cover in the future, drop a note in the comments below! I've been on the road lately talking about Google APIs, AI included of course. Find the travel calendar at the bottom of my consulting site... I'd love to meet you IRL if I'm visiting your region!
Resources
-
Google AI Gemini 1.0 models; Python code samples from this post
-
Other blog post code samples
-
Gemini API (Google AI)
-
Gemini API (GCP Vertex AI)
-
Gemini API (differences between both platforms)
-
Gemini 1.5 (preview)
-
Other Generative AI and Gemini resources
WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall's bestselling "Core Python" series, co-author of "Python Web Development with Django", and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb or buy him a coffee (or tea)!
Top comments (0)