Skip to main content

Generate

POST 

/generate_vertex

Generate

Generate a full response, designed for use with Vertex AI, it is effectively a proxy for the /generate endpoint on the inference API, with some tweaks for compatibility.

The /generate endpoint is used to communicate with the LLM. Use this endpoint when you want to receive a full response from the LLM, all at once.

To send a batch of requests all at once, the text field can be either a string, or an array of strings. This server also supports dynamic batching, where requests in a short time interval are processed as a single batch.

Request​

Body

required

    instances

    object[]

    required

  • Array [

  • constrained_decoding_backendstringnullable
    consumer_groupstringnullable
    image_pathsstring[]nullable
    json_schemanullable
    lora_idstringnullable
    max_new_tokensint64nullable
    min_new_tokensint64nullable
    no_repeat_ngram_sizeint64nullable
    prompt_max_tokensint64nullable
    regex_stringstringnullable
    repetition_penaltyfloatnullable
    sampling_temperaturefloatnullable
    sampling_topkint64nullable
    sampling_toppfloatnullable

    text

    object

    required

    Input Text used for ease of users not to have to use the clunky PayloadText. Mapping provided below to convert InputText to PayloadText.

    oneOf

    string

    use_chat_templatebooleannullable
  • ]

Responses​

Takes in a JSON payload and returns the response all at once.

Schema

    predictions

    object[]

    required

  • Array [

  • text

    object

    required

    Input Text used for ease of users not to have to use the clunky PayloadText. Mapping provided below to convert InputText to PayloadText.

    oneOf

    string

  • ]