Wesley Chun (@wescpy)

Posted on Apr 25 • Updated on May 3

Gemini API 102: Next steps beyond "Hello World!"

#python #google #ai #googleapi

TL;DR:

The previous post in this ongoing series introduced developers to the Gemini API by providing a more user-friendly and useful "Hello World!" sample than in the official Google documentation. The next steps: enhance that example to learn a few more features of the Gemini API, for example, support for streaming output and multi-turn conversations (chat), upgrade to the latest 1.0 or even 1.5 API versions, and switch to multimodality... stick around to find out how!

Introduction

Are you a developer interested in using Google APIs? You're in the right place as this blog is dedicated to that craft from Python and sometimes Node.js. Previous posts showed you how to use Google credentials like API keys or OAuth client IDs for use with Google Workspace (GWS) APIs. Other posts introduced serverless computing or showed you how to export Google Docs as PDF. If you're interested in Google APIs, you're in the right place.

The previous post kicked off the conversation about generative AI, presenting "Hello World!" examples that help you get started with the Gemini API in a more user-friendly way than in the docs. It presented samples showing you how to use the API from both Google AI as well as GCP Vertex AI.

This post follows up with a multimodal example, one that supports streaming output, and another one leveraging multi-turn conversations ("chat"), and finally, another one that upgrades to using the latest 1.0 and 1.5 models, the latter of which is in public preview at the time of this writing.

Whereas your initial journey began with code in both Python & Node.js plus API access from both Google AI & Vertex AI, this post focuses specifically on the "upgrades," so we're just gonna stick with one of each: Python-only & only on Google AI. Use the previous post's variety of samples to "extrapolate" porting to Node.js or running on Vertex AI.

Prerequisites

The example assumes you've performed the prerequisites from the previous post:

Installed the Google GenAI Python package with: pip install -U pip google-generativeai
Created an API key
Saved API key as a string to settings.py as API_KEY = 'YOUR_API_KEY_HERE' (and followed the suggestions for only hard-coding it in prototypes and keeping it safe when deploying to production)

For today's code sample, there a couple more packages to install:

The popular Python HTTP requests library
The Python Imaging Library (PIL)'s flexible fork, Pillow

You can do so along with updating the GenAI package with: pip install -U Pillow requests google-generativeai (or pip3)

The "OG"

Let's start with the original script from the first post that we're going to upgrade here, gemtxt-simple-gai.py:

import google.generativeai as genai
from settings import API_KEY

PROMPT = 'Describe a cat in a few sentences'
MODEL = 'gemini-pro'
print('** GenAI text: %r model & prompt %r\n' % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT)
print(response.text)

[CODE] gemtxt-simple-gai.py: "Hello World!" sample from previous post

Review the original post if you need a description of the code. This is the starting point for the remaining examples here.

Upgrade API version

The simplest update is to upgrade the API version. The original Gemini API 1.0 version was named gemini-pro. It was replaced soon thereafter by gemini-1.0-pro, and after that, the latest version, gemini-1.0-pro-latest.

import google.generativeai as genai
from settings import API_KEY

PROMPT = 'Describe a cat in a few sentences'
MODEL = 'gemini-1.0-pro-latest'
print('** GenAI text: %r model & prompt %r\n' % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT)
print(response.text)

[CODE] gemtxt-simple10-gai.py: Uses latest Gemini 1.0 Pro model

The one-line upgrade was effected by updating the MODEL variable. This "delta" version is available in the repo as gemtxt-simple10-gai.py. Executing it results in output similar to the original version:

$ python3 gemtxt-simple10-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model & prompt 'Describe a
cat in a few sentences'

 A cat is a small, furry mammal with sharp claws and teeth. It
is a carnivore, meaning that it eats other animals. Cats are
often kept as pets because they are affectionate and playful.
They are also very good at catching mice and other small
rodents.

If you have access to the 1.5 API, update MODEL to gemini-1.5-pro-latest. The remaining samples below stay with the latest 1.0 model as 1.5 is still in preview.

Streaming

The next easiest update is to change to streaming output. When sending a request to an LLM (large language model), sometimes you don't want to wait for all of the output from the model to return before displaying to users. To give them a better experience, "stream" the output as it comes instead:

import google.generativeai as genai
from settings import API_KEY

PROMPT = 'Describe a cat in a few sentences'
MODEL = 'gemini-1.0-pro-latest'
print('** GenAI text: %r model & prompt %r\n' % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content(PROMPT, stream=True)
for chunk in response:
    print(chunk.text, end='')
print()

[CODE] gemtxt-stream10-gai.py: Produces streaming output

Switching to streaming requires only the stream=True flag passed to the model's generate_content() method. The loop displays the chunks of data returned by the LLM as they come in. To keep the spacing consistent, set Python's print() function to not output a NEWLINE (\n) after each chunk with the end parameter.

Instead, keep chaining the chunks together and issue the NEWLINE after all have been retrieved and displayed. This version is also available in the repo as gemtxt-stream10-gai.py. Its output here isn't going to reveal the output as it is streamed, so you have to take my work for it. :-)

$ python3 gemtxt-stream10-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model & prompt 'Describe a
cat in a few sentences'

 A cat is a small, carnivorous mammal with soft fur, retractable
claws, and sharp teeth. They are known for their independence,
cleanliness, and playful nature. With its keen senses and
graceful movements, a cat exudes both mystery and intrigue. Its
sleek body is covered in sleek fur that ranges in color from
black to white to tabby.

Multi-turn conversations (chat)

Now, you may be building a chat application or executing a workflow where your user or system must interact with the model more than once, keeping context between messages. To facilitate this exchange, Google provides a convenience chat object, obtained with start_chat() which features a send_message() method for communicating with the model instead of generate_content(), as shown below:

import google.generativeai as genai
from settings import API_KEY

PROMPTS = ('Describe a cat in a few sentences',
    "Since you're now a feline expert, what are the top three "
    'most friendly cat breeds for a family with small children?'
)
MODEL = 'gemini-1.0-pro-latest'
print('** GenAI text: %r model\n' % MODEL)

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
chat = model.start_chat()
for prompt in PROMPTS:
    print('\n    USER:', prompt)
    response = chat.send_message(prompt)
    print('\n    MODEL:', response.text)
print()

[CODE] gemtxt-simple10-chat-gai.py: Supports multi-turn conversations

While the flow is slightly different from what you've already seen, the basic operations are the same: send a prompt to the model and await the response. The core difference is that you're sending multiple messages in a row, with each subsequent message maintaining the full context of the ongoing "conversation." This version is found in the repo as gemtxt-simple10-chat-gai.py, and shown here is one sample exchange with the model:

$ python3 gemtxt-simple10-chat-gai.py
** GenAI text: 'gemini-1.0-pro-latest' model


    USER: Describe a cat in a few sentences

    MODEL: With its sleek fur, piercing eyes, and playful
    spirit, the cat exudes both elegance and mischief. Its
    nimble body and graceful movements make it an agile
    hunter, while its affectionate nature brings joy to any
    household. Its curious and independent spirit ensures
    that each day brings new adventures for this feline
    companion.

    USER: Since you're now a feline expert, what are the top
    three most friendly cat breeds for a family with small
    children?

    MODEL: **Top 3 Most Friendly Cat Breeds for Families
    with Small Children:**

    1. **Ragdoll:** Known for their docile and affectionate
    nature, Ragdolls are incredibly gentle and patient with
    children. They love to be cuddled and enjoy spending
    time with their human companions.

    2. **Maine Coon:** Despite their large size, Maine Coons
    are known for their sweet and playful personalities. They
    are great with kids and are often described as gentle
    giants. Their playful and curious nature makes them a joy
    to have around.

    3. **Siamese:** While Siamese cats are known for being
    vocal, they are also highly intelligent and affectionate.
    They form strong bonds with their family members,
    including children, and enjoy being involved in all
    aspects of family life.

I'm not a cat owner, so I can't vouch for Gemini's accuracy there. Add a comment below if you have a take on it. Now let's switch gears a bit.

So far, all of the enhancements and corresponding samples are text-based, single-modality requests. A whole new class of functionality is available if a model can accept data in addition to text, in other form factors such as images, audio, or video content. The Google AI documentation states that this wider variety of input, "creates many additional possibilities for generating content, analyzing data, and solving problems."

Multimodal

Some Gemini models, and by extension, their corresponding APIs, support multimodality, "prompting with text, image, and audio data". Video is also supported, but you need to use the File API to convert them to a series of image frames. You can also use the File API to upload the assets to use in your prompts.

The sample script below takes an image and asks the LLM for some information about it, specifically this image:

Dome waterfall — [IMAGE] Indoors dome waterfall; SOURCE: author (CC-BY-4.0)

The prompt is a fairly straightforward query: Where is this located, and what's the waterfall's name?. Here is the multimodal version of the script posing this query... it's available in the repo as gemmmd-simple10loc-gai.py:

from PIL import Image
import google.generativeai as genai
from settings import API_KEY

IMG = 'dome-waterfall.jpg'
DATA = Image.open(IMG)
PROMPT = "Where is this located, and what's the waterfall's name?"
MODEL = 'gemini-1.0-pro-vision-latest'
print('** GenAI multimodal: %r model & prompt %r\n' % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content((PROMPT, DATA))
print(response.text)

[CODE] gemmmd-simple10loc-gai.py: Multimodal sample with text & local image prompt

These are the key updates from the original app:

Change to multimodal model: Gemini 1.0 Pro to Gemini 1.0 Pro Vision
Import Pillow and use it to read the image data given its filename
New prompt: pass in prompt string plus image payload

The MODEL variable now points to gemini-1.0-pro-vision-latest, the image filename is passed to Pillow to read its DATA, and rather than a single PROMPT string, pass in both the PROMPT and image DATA as a 2-tuple to generate_content(). Everything else stays the same. Let's see what Gemini says:

$ python3 gemmmd-simple10loc-gai.py
** GenAI multimodal: 'gemini-1.0-pro-vision-latest' model & prompt
"Where is this located, and what's the waterfall's name?"

 The waterfall is located in the Jewel Changi Airport in
Singapore. It is called the HSBC Rain Vortex.

Online data vs. local

The final update is to take the previous example and change it to access images online rather than requiring it be available on the local filesystem. For this, we'll use one of Google's stock images:

[IMAGE] Friendly man in office environment; SOURCE: Google

This one is pretty much identical as the one above, but uses the Python requests library to access the image for Pillow. The script below asks Gemini to Describe the scene in this photo and can be accessed in the repo as gemmmd-simple10url-gai.py:

from PIL import Image
import requests
import google.generativeai as genai
from settings import API_KEY

IMG_URL = 'https://google.com/services/images/section-work-card-img_2x.jpg'
IMG_RAW = Image.open(requests.get(IMG_URL, stream=True).raw)
PROMPT = 'Describe the scene in this photo'
MODEL = 'gemini-1.0-pro-vision-latest'
print('** GenAI multimodal: %r model & prompt %r\n' % (MODEL, PROMPT))

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL)
response = model.generate_content((PROMPT, IMG_RAW))
print(response.text)

[CODE] gemmmd-simple10url-gai.py: Multimodal sample with text & online image prompt

New includes the import of requests followed by its use to perform an HTTP GET on the image URL (IMG_URL), reading the binary payload into IMG_RAW, which is passed along with the text prompt to generate_content(). Running this script results in the following output:

$ python3 gemmmd-simple10url-gai.py
** GenAI multimodal: 'gemini-1.0-pro-vision-latest' model &
prompt 'Describe the scene in this photo'

 A young Asian man is sitting at a desk in an office. He is
wearing a white shirt and black pants. He has a big smile on
his face and is gesturing with his hands. There is a laptop,
notebook, and pen on the desk. There is a couch and some
plants in the background. The man is probably giving a
presentation or having a conversation with someone.

I originally designed a sixth derivative sample script to turn the above into a multi-turn conversation chat app, intending to further query the model asking, You are a marketing expert. If a company uses this photo in a press release, what product or products could they be selling? I discovered, unfortunately, that multi-turn conversation isn't supported by the Gemini 1.0 Pro Vision multimodal model:

$ python3 gemmmd-simple10url-chat-gai.py
Traceback (most recent call last):
  . . .
  . . .
  File "/home/wescpy/.local/lib/python3.9/site-packages/google/
  api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
  File "/home/wescpy/.local/lib/python3.9/site-packages/google/
  api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 Multiturn chat
    is not enabled for models/gemini-1.0-pro-vision-latest

While multimodal chat isn't supported by Gemini 1.0, it is available via the 1.5 preview, however that is the exact conversation we'll be having in the next post. Here's an abridged preview:

$ python3 gemmmd-simple15url-chat-gai.py
** GenAI multimodal: 'gemini-1.5-pro-latest' model


    USER: Describe the scene in this photo

    MODEL: A young man is sitting at a desk in a modern
    office, smiling and gesturing with his hands. He is
    wearing a light blue shirt and has dark hair. On the
    desk in front of him is a laptop, a notebook, a pen,
    and a cell phone. There is a comfortable-looking sofa
    and chair behind him, and large windows offer a view
    of the city skyline.

    USER: You are a marketing expert. If a company uses
    this photo in a press release, what product or
    products could they be selling?

    MODEL: ## Products this photo could promote:

    Given the positive energy and modern office setting,
    this photo could be used to market a variety of
    products or services related to:

    **Technology and Productivity:**

    * **Project management software:** The man's
    expression and gestures suggest successful
    completion or organization.
    * **Communication and collaboration tools:** The open
    office layout hints at teamwork and connectivity.
    * **Cloud-based services:** The laptop and modern
    setting imply a reliance on technology and online
    solutions.
    * **Productivity apps or time management tools:**
    The organized desk and the man's focused demeanor
    suggest efficiency and control.
. . .
. . .

Summary

Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous post in the series got your foot in the door, presenting a more digestible user-friendly "Hello World!" sample to help developers get started.

This post presents possible next steps, providing "102" samples that enhance the original script, furthering your exploration of Gemini API features but doing so without overburdening you with large swaths of code.

More advanced features are available via the Gemini API we didn't cover here — they merit separate posts on their own:

The next post in the series focuses on Gemini's responses and explores the differences between the 1.0 and 1.5 models' outputs across a variety of queries, so stay tuned for that. If you found an error in this post or have a topic you want me to cover in the future, drop a note in the comments below! I've been on the road lately talking about Google APIs, AI included of course. Find the travel calendar at the bottom of my consulting site... I'd love to meet you IRL if I'm visiting your region!

Resources

Google AI Gemini 1.0 models; Python code samples from this post
Other blog post code samples
- Gemini API samples
- Other Google APIs samples
Gemini API (Google AI)
Gemini API (GCP Vertex AI)
Gemini API (differences between both platforms)
- Google AI for GCP Vertex AI users
- GCP Vertex AI for Google AI users
Gemini 1.5 (preview)
Other Generative AI and Gemini resources

WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall's bestselling "Core Python" series, co-author of "Python Web Development with Django", and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb or buy him a coffee (or tea)!

DEV Community

Gemini API 102: Next steps beyond "Hello World!"

TL;DR:

Introduction

Prerequisites

The "OG"

Upgrade API version

Streaming

Multi-turn conversations (chat)

Multimodal

Online data vs. local

Summary

Resources

Top comments (0)

Read next

ToonCrafter: Generative Cartoon Interpolation

Generative Adversarial Network (GAN)

How we saved our partners 💵$460,000 and 2,5 months⏰ of work

Microsoft OpenAI Architecture