Subscribe to my youtube channel for Video Tutorials: https://www.youtube.com/@LibraryofCelsus
(Channel not launched yet)
Chat Api #
This guide walks you through making calls using the instruct method with the Oobabooga Api, passing on the instruction, username, and prompt to the main loop. Guide uses Llama 2 Chat formatting.
1. Define Initial Settings #
- Connect to the Oobabooga Api and define the needed libraries.
import os
import json
import requests
# For local streaming, the websockets are hosted without ssl - http://
HOST = 'localhost:5000'
URI = f'http://{HOST}/api/v1/chat'
# For reverse-proxied streaming, the remote will likely host with ssl - https://
# URI = 'https://your-uri-here.trycloudflare.com/api/v1/generate'
- Define the history variable, we will be setting this to empty in order to maintain an OpenAi Api approach.
def oobabooga(instruction, user_name, prompt):
history = {'internal': [], 'visible': []}
2. Configure Request #
Within the function, configure the request settings:
- Define user input, max tokens, and history:
request = {
'user_input': prompt,
'max_new_tokens': 800,
'history': history,
- Set the mode, model, and instruct format:
'mode': 'instruct', # Valid options: 'chat', 'chat-instruct', 'instruct'
'instruction_template': 'Llama-v2', # Will get autodetected if unset
'context_instruct': f"{instruction}", # Optional
- Assign the user’s name:
'your_name': f'{user_name}',
3. Tune Output Settings #
It’s essential to adjust the settings that influence the generation output:
Temperature: Temperature is the main factor in controlling the randomness of outputs. It directly effects the probability distribution of the outputs. A lower temperature will be more deterministic, while as a higher temperature will be more creative.
Top_P: Top_p is also known as nucleus sampling. It is an alternative to just using temperature alone in controlling the randomness of the model’s output. This setting will choose from the smallest set of tokens whose cumulative probability exceeds a threshold p. This set of tokens is referred to as the “nucleus”. It is more dynamic than “top_k” and can lead to more diverse and richer outputs, especially in cases where the model is uncertain.
Top_K: Top_k is another sampling strategy where the model first calculates probabilities for each token in its vocabulary, instead of considering the entire vocabulary as a whole. It restricts the model to only select an output from the k most likely tokens. Top_k is alot more predictable and more simple to use than top_p, but it can make the output too narrow and repetitive.
Repetition_Penalty: This setting will help us improve the output by reducing redundant or repetitive content. When the model is generating an output, the repetition penalty will either discourage, or encourage, repeated selection of the same tokens.
Min_Length: This setting will set a minimum length for the generated output, although I have found it doesn’t work with all models.
Truncation_Length: This setting will set the length of our prompt before it is cut off.
'regenerate': False,
'_continue': False,
'stop_at_newline': False,
'chat_generation_attempts': 1,
# Generation params. If 'preset' is set to different than 'None', the values
# in presets/preset-name.yaml are used instead of the individual numbers.
'preset': 'None',
'do_sample': True,
'temperature': 0.85,
'top_p': 0.2,
'typical_p': 1,
'epsilon_cutoff': 0, # In units of 1e-4
'eta_cutoff': 0, # In units of 1e-4
'tfs': 1,
'top_a': 0,
'repetition_penalty': 1.18,
'top_k': 40,
'min_length': 40,
'no_repeat_ngram_size': 0,
'num_beams': 1,
'penalty_alpha': 0,
'length_penalty': 1,
'early_stopping': False,
'mirostat_mode': 0,
'mirostat_tau': 5,
'mirostat_eta': 0.1,
'seed': -1,
'add_bos_token': True,
'truncation_length': 4096,
'ban_eos_token': False,
'skip_special_tokens': True,
'stopping_strings': []
}
4. Send Request and Retrieve Response #
- Post the request and process the response:
response = requests.post(URI, json=request)
if response.status_code == 200:
result = response.json()['results'][0]['history']
return result['visible'][-1][1]
5. Build the Main Loop #
- Begin with the initiation of the main chatbot loop:
if __name__ == '__main__':
conversation = list()
6. Define Chatbot & User Details #
- Set the system prompt and user details. This guide uses Llama-2-Chat formatting.
bot_name = "Assistant"
user_name = "User"
usernameupper = user_name.upper()
botnameupper = bot_name.upper()
instruction = (f"[INST] <<SYS>>\nYou are a helpful Ai assistant. Respond to the user with long, verbose answers.\n<</SYS>>")
7. Handle User Interaction #
- Capture and process user input:
while True:
try:
user_input = input(f'\n\n{usernameupper}: ')
8. Append Conversation List #
- Append conversation list with username, user input, and chatbot name.
conversation.append({'content': f"[INST] {usernameupper}: {user_input} [/INST]"})
conversation.append({'content': f"{botnameupper}: "})
9. Response Generation #
- Join together the conversation list and pass on the required variables to the Api function:
prompt = ''.join([message_dict['content'] for message_dict in conversation])
output = oobabooga(instruction, user_name, prompt)
- Print the response from Oobabooga and clear the conversation list for future generations.
print(f"{botnameupper}: {output}")
conversation.clear()
except Exception as e:
print(e)