Model: GPT-4
Avatar of the person chatting

jsonファイルの日本語ドキュメントを元に、英訳したjsonファイルを出力するPythonコードを書きたいです。 # jsonファイルのフォーマット ``` { "pages": [ { "title": "page1title", "lines": [ "page1title", "line2", "line3" ] }, { "title": "page2title", "lines": [ "page2title", "line2", "line3" ] }, ... ] } ``` # 実行すること 1. jsonファイルを受け取る 2. 全pageのtitleのみを抽出し、英訳し、日英のtitleをまとめたdictionaryを作る 3. それぞれのページの本文(lines)を翻訳する。注意: 他のページのtitleへのリンクが[タイトル]と鉤括弧に囲われ含まれている。それらのタイトルは、dictionaryが持つ英訳に一字一句従って翻訳される必要がある。 # 翻訳の方法 ChatGPT APIを用いる。ChatGPT APIの入力と出力のサンプルは以下の通りである。assistantに英訳する様に伝え、出力を使いたい。翻訳時に条件や注意点がある時は、そのことも伝える。 ```sample call # Note: you need to be using OpenAI Python v0.27.0 for the code below to work import openai openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, {"role": "user", "content": "Where was it played?"} ] ) ``` ```sample response { 'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve', 'object': 'chat.completion', 'created': 1677649420, 'model': 'gpt-3.5-turbo', 'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87}, 'choices': [ { 'message': { 'role': 'assistant', 'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'}, 'finish_reason': 'stop', 'index': 0 } ] } ```

以下のPythonコードは、指定された要件を満たすように設計されています。このコードは、入力として日本語のJSONファイルを受け取り、それを英語に翻訳し、翻訳されたJSONファイルを出力します。

まず、translate()関数を作成して、ChatGPT APIを使用して翻訳を行います。次に、メインの処理でJSONファイルを読み込み、指定された手順に従って翻訳を実行します。

pythonCopy code
import json import openai def translate(text, role="user"): response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."}, {"role": role, "content": text} ] ) return response.choices[0].message.content def translate_json_file(input_file, output_file): with open(input_file, 'r', encoding='utf-8') as f: data = json.load(f) title_translation_dict = {} # Translate page titles for page in data['pages']: translated_title = translate(page['title']) title_translation_dict[page['title']] = translated_title page['title'] = translated_title # Translate lines with translated titles replaced for page in data['pages']: translated_lines = [] for line in page['lines']: for jp_title, en_title in title_translation_dict.items(): line = line.replace(f"[{jp_title}]", f"[{en_title}]") translated_line = translate(line) translated_lines.append(translated_line) page['lines'] = translated_lines with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) # Replace 'input.json' and 'output.json' with your desired file paths translate_json_file('input.json', 'output.json')

このコードを実行する前に、openaiモジュールをインストールし、OpenAI APIキーを設定しておく必要があります。

bashCopy code
pip install openai

Pythonコード内でAPIキーを設定する方法は次のとおりです。

pythonCopy code
import openai openai.api_key = "your_api_key_here"

APIキーを環境変数として設定することもできます。これを行うには、次のコマンドを実行してください(環境変数名は OPENAI_API_KEY です)。

bashCopy code
export OPENAI_API_KEY=your_api_key_here

このコードは、指定されたJSONファイルを翻訳し、翻訳されたテキストを含む新しいJSONファイルを出力します。ファイル名は、translate_json_file関数の引数として指定できます。

Avatar of the person chatting

2 / 2ありがとう。 上のコードは、それぞれのtitleやlineで毎回translate()を呼んでいますが、呼べば呼ぶほどAPIのコストがかかるので、呼ぶ回数は最小限にしたい。しかし、ChatCompletionに渡せる最大の文字数は8000程度です。 titleを翻訳する時には、渡せる最大文字数分までを一度に翻訳してほしい。 それぞれのページを訳す時は、同じページは一度に翻訳してほしい。文字数制限に引っかかるなら適宜分割してほしい。

指定された要件に従って、API呼び出し回数を最小限に抑えるようにコードを修正しました。以下のコードでは、タイトルの翻訳時に一度に翻訳できるだけ翻訳し、ページの翻訳時にも一度に翻訳するようにしています。ただし、文字数制限に引っかかる場合は、適宜分割して翻訳します。

pythonCopy code
import json import openai MAX_TOKENS = 8000 def translate(text, role="user"): response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."}, {"role": role, "content": text} ] ) return response.choices[0].message.content def translate_titles(title_list): translated_titles = [] title_chunk = "" for title in title_list: if len(title_chunk) + len(title) + 1 < MAX_TOKENS: title_chunk += title + "\n" else: translated_chunk = translate(title_chunk, role="title_translation") translated_titles.extend(translated_chunk.split("\n")[:-1]) title_chunk = title + "\n" if title_chunk: translated_chunk = translate(title_chunk, role="title_translation") translated_titles.extend(translated_chunk.split("\n")[:-1]) return translated_titles def translate_page(page_text): if len(page_text) <= MAX_TOKENS: return translate(page_text, role="page_translation") else: split_point = page_text.rfind("\n", 0, MAX_TOKENS) first_half = page_text[:split_point] second_half = page_text[split_point + 1:] return translate_page(first_half) + "\n" + translate_page(second_half) def translate_json_file(input_file, output_file): with open(input_file, 'r', encoding='utf-8') as f: data = json.load(f) title_translation_dict = {} # Translate page titles title_list = [page['title'] for page in data['pages']] translated_titles = translate_titles(title_list) for original_title, translated_title in zip(title_list, translated_titles): title_translation_dict[original_title] = translated_title for page, translated_title in zip(data['pages'], translated_titles): page['title'] = translated_title # Translate lines with translated titles replaced for page in data['pages']: page_text = "\n".join(page['lines']) for jp_title, en_title in title_translation_dict.items(): page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]") translated_text = translate_page(page_text) page['lines'] = translated_text.split("\n") with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) # Replace 'input.json' and 'output.json' with your desired file paths translate_json_file('input.json', 'output.json')

このコードでは、タイトル翻訳用に translate_titles 関数を、ページ翻訳用に translate_page 関数を作成しています。

Avatar of the person chatting

ありがとう。今はChatCompletionを同期的に呼んでいると思うけど、これを並列・非同期処理にしてほしい。

了解しました。以下のコードでは、aiohttpasyncioを使って、ChatCompletionを非同期で並列処理するように変更しました。これにより、翻訳のパフォーマンスが向上するはずです。

まず、aiohttpasyncioをインストールしてください。

bashCopy code
pip install aiohttp asyncio

次に、以下のコードを実行します。

pythonCopy code
import json import asyncio import aiohttp import openai MAX_TOKENS = 8000 async def async_translate(session, text, role="user"): headers = { "Content-Type": "application/json", "Authorization": f"Bearer {openai.api_key}" } data = { "model": "gpt-3.5-turbo", "messages": [ {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."}, {"role": role, "content": text} ] } async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp: response = await resp.json() return response["choices"][0]["message"]["content"] async def translate_titles(session, title_list): translated_titles = [] title_chunk = "" for title in title_list: if len(title_chunk) + len(title) + 1 < MAX_TOKENS: title_chunk += title + "\n" else: translated_chunk = await async_translate(session, title_chunk, role="title_translation") translated_titles.extend(translated_chunk.split("\n")[:-1]) title_chunk = title + "\n" if title_chunk: translated_chunk = await async_translate(session, title_chunk, role="title_translation") translated_titles.extend(translated_chunk.split("\n")[:-1]) return translated_titles async def translate_page(session, page_text): if len(page_text) <= MAX_TOKENS: return await async_translate(session, page_text, role="page_translation") else: split_point = page_text.rfind("\n", 0, MAX_TOKENS) first_half = page_text[:split_point] second_half = page_text[split_point + 1:] first_half_translated = await async_translate(session, first_half, role="page_translation") second_half_translated = await translate_page(session, second_half) return first_half_translated + "\n" + second_half_translated async def translate_json_file(input_file, output_file): with open(input_file, 'r', encoding='utf-8') as f: data = json.load(f) title_translation_dict = {} # Translate page titles title_list = [page['title'] for page in data['pages']] async with aiohttp.ClientSession() as session: translated_titles = await translate_titles(session, title_list) for original_title, translated_title in zip(title_list, translated_titles): title_translation_dict[original_title] = translated_title for page, translated_title in zip(data['pages'], translated_titles): page['title'] = translated_title # Translate lines with translated titles replaced for page in data['pages']: page_text = "\n".join(page['lines']) for jp_title, en_title in title_translation_dict.items(): page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]") translated_text = await translate
Avatar of the person chatting

continue from this line

My apologies for the incomplete response. Here is the complete code:

pythonCopy code
translated_text = await translate_page(session, page_text) page['lines'] = translated_text.split("\n") with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) # Replace 'input.json' and 'output.json' with your desired file paths async def main(): await translate_json_file('input.json', 'output.json') asyncio.run(main())

In the updated code, we have defined the main function as asynchronous, and then use asyncio.run(main()) to run the async translate_json_file function. This way, the translation process takes advantage of asynchronous processing, which should improve the performance of the translation tasks.

Avatar of the person chatting

2 / 2ありがとう。修正点として、roleは常にsystemかuserであるべき。 あと、それぞれのtranslate_pageはconcurrentlyにやってほしい。

了解しました。以下のコードでは、roleを常にsystemuserに設定し、translate_pageの処理を並行して実行するように変更しました。

pythonCopy code
import json import asyncio import aiohttp import openai MAX_TOKENS = 8000 async def async_translate(session, text): headers = { "Content-Type": "application/json", "Authorization": f"Bearer {openai.api_key}" } data = { "model": "gpt-3.5-turbo", "messages": [ {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."}, {"role": "user", "content": text} ] } async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp: response = await resp.json() return response["choices"][0]["message"]["content"] async def translate_titles(session, title_list): translated_titles = [] title_chunk = "" for title in title_list: if len(title_chunk) + len(title) + 1 < MAX_TOKENS: title_chunk += title + "\n" else: translated_chunk = await async_translate(session, title_chunk) translated_titles.extend(translated_chunk.split("\n")[:-1]) title_chunk = title + "\n" if title_chunk: translated_chunk = await async_translate(session, title_chunk) translated_titles.extend(translated_chunk.split("\n")[:-1]) return translated_titles async def translate_page(session, page_text): if len(page_text) <= MAX_TOKENS: return await async_translate(session, page_text) else: split_point = page_text.rfind("\n", 0, MAX_TOKENS) first_half = page_text[:split_point] second_half = page_text[split_point + 1:] first_half_translated = await async_translate(session, first_half) second_half_translated = await translate_page(session, second_half) return first_half_translated + "\n" + second_half_translated async def translate_json_file(input_file, output_file): with open(input_file, 'r', encoding='utf-8') as f: data = json.load(f) title_translation_dict = {} # Translate page titles title_list = [page['title'] for page in data['pages']] async with aiohttp.ClientSession() as session: translated_titles = await translate_titles(session, title_list) for original_title, translated_title in zip(title_list, translated_titles): title_translation_dict[original_title] = translated_title for page, translated_title in zip(data['pages'], translated_titles): page['title'] = translated_title # Translate lines with translated titles replaced translation_tasks = [] for page in data['pages']: page_text = "\n".join(page['lines']) for jp_title, en_title in title_translation_dict.items(): page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]") translation_tasks.append(translate_page(session, page_text)) translated_texts = await asyncio.gather(*translation_tasks) for page, translated_text in zip(data['pages'], translated_texts): page['lines'] = translated_text.split("\n") with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) # Replace 'input
Avatar of the person chatting

continue

Here is the continuation of the code:

pythonCopy code
# Replace 'input.json' and 'output.json' with your desired file paths async def main(): await translate_json_file('input.json', 'output.json') asyncio.run(main())

In this updated code, the translate_page tasks are executed concurrently using asyncio.gather. This should further improve the performance of the translation process.

Avatar of the person chatting

4 / 4Before sending text to OpenAI, - replace indenting space/tab/full width space in the beginning of each line to \s. If there's n number of spaces, replace them to n number of \s. - replace newline to \n. After translation, replace those back to space and newline. Regex might be useful. Modify async_translate to do this.

To achieve this, we'll use the re module for handling regular expressions. Here's the modified async_translate function that replaces leading spaces/tabs/full width spaces with \s and newlines with \n before translation, and then reverts them back after translation:

pythonCopy code
import re async def async_translate(session, text): # Replace leading spaces/tabs/full width spaces with \s text = re.sub(r'^([ \t ]+)', lambda m: '\\s' * len(m.group(1)), text, flags=re.MULTILINE) # Replace newlines with \n text = text.replace('\n', '\\n') headers = { "Content-Type": "application/json", "Authorization": f"Bearer {openai.api_key}" } data = { "model": "gpt-3.5-turbo", "messages": [ {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."}, {"role": "user", "content": text} ] } async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp: response = await resp.json() translated_text = response["choices"][0]["message"]["content"] # Replace \n back to newline translated_text = translated_text.replace('\\n', '\n') # Replace \s back to spaces translated_text = re.sub(r'\\s', ' ', translated_text) return translated_text

Now, the async_translate function replaces the specified characters before sending the text for translation and reverts them back in the translated text.