Check out this ShareGPT conversation

Model: GPT-4

jsonファイルの日本語ドキュメントを元に、英訳したjsonファイルを出力するPythonコードを書きたいです。 # jsonファイルのフォーマット ``` { "pages": [ { "title": "page1title", "lines": [ "page1title", "line2", "line3" ] }, { "title": "page2title", "lines": [ "page2title", "line2", "line3" ] }, ... ] } ``` # 実行すること 1. jsonファイルを受け取る 2. 全pageのtitleのみを抽出し、英訳し、日英のtitleをまとめたdictionaryを作る 3. それぞれのページの本文（lines)を翻訳する。注意: 他のページのtitleへのリンクが[タイトル]と鉤括弧に囲われ含まれている。それらのタイトルは、dictionaryが持つ英訳に一字一句従って翻訳される必要がある。 # 翻訳の方法 ChatGPT APIを用いる。ChatGPT APIの入力と出力のサンプルは以下の通りである。assistantに英訳する様に伝え、出力を使いたい。翻訳時に条件や注意点がある時は、そのことも伝える。 ```sample call # Note: you need to be using OpenAI Python v0.27.0 for the code below to work import openai openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, {"role": "user", "content": "Where was it played?"} ] ) ``` ```sample response { 'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve', 'object': 'chat.completion', 'created': 1677649420, 'model': 'gpt-3.5-turbo', 'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87}, 'choices': [ { 'message': { 'role': 'assistant', 'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'}, 'finish_reason': 'stop', 'index': 0 } ] } ```

以下のPythonコードは、指定された要件を満たすように設計されています。このコードは、入力として日本語のJSONファイルを受け取り、それを英語に翻訳し、翻訳されたJSONファイルを出力します。

まず、translate()関数を作成して、ChatGPT APIを使用して翻訳を行います。次に、メインの処理でJSONファイルを読み込み、指定された手順に従って翻訳を実行します。

pythonCopy code
import json
import openai

def translate(text, role="user"):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": role, "content": text}
        ]
    )
    return response.choices[0].message.content

def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    for page in data['pages']:
        translated_title = translate(page['title'])
        title_translation_dict[page['title']] = translated_title
        page['title'] = translated_title

    # Translate lines with translated titles replaced
    for page in data['pages']:
        translated_lines = []
        for line in page['lines']:
            for jp_title, en_title in title_translation_dict.items():
                line = line.replace(f"[{jp_title}]", f"[{en_title}]")
            translated_line = translate(line)
            translated_lines.append(translated_line)
        page['lines'] = translated_lines

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')

このコードを実行する前に、openaiモジュールをインストールし、OpenAI APIキーを設定しておく必要があります。

bashCopy code
pip install openai

Pythonコード内でAPIキーを設定する方法は次のとおりです。

pythonCopy code
import openai

openai.api_key = "your_api_key_here"

APIキーを環境変数として設定することもできます。これを行うには、次のコマンドを実行してください（環境変数名は OPENAI_API_KEY です）。

bashCopy code
export OPENAI_API_KEY=your_api_key_here

このコードは、指定されたJSONファイルを翻訳し、翻訳されたテキストを含む新しいJSONファイルを出力します。ファイル名は、translate_json_file関数の引数として指定できます。

2 / 2ありがとう。上のコードは、それぞれのtitleやlineで毎回translate()を呼んでいますが、呼べば呼ぶほどAPIのコストがかかるので、呼ぶ回数は最小限にしたい。しかし、ChatCompletionに渡せる最大の文字数は8000程度です。 titleを翻訳する時には、渡せる最大文字数分までを一度に翻訳してほしい。それぞれのページを訳す時は、同じページは一度に翻訳してほしい。文字数制限に引っかかるなら適宜分割してほしい。

指定された要件に従って、API呼び出し回数を最小限に抑えるようにコードを修正しました。以下のコードでは、タイトルの翻訳時に一度に翻訳できるだけ翻訳し、ページの翻訳時にも一度に翻訳するようにしています。ただし、文字数制限に引っかかる場合は、適宜分割して翻訳します。

pythonCopy code
import json
import openai

MAX_TOKENS = 8000

def translate(text, role="user"):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": role, "content": text}
        ]
    )
    return response.choices[0].message.content

def translate_titles(title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = translate(title_chunk, role="title_translation")
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = translate(title_chunk, role="title_translation")
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

def translate_page(page_text):
    if len(page_text) <= MAX_TOKENS:
        return translate(page_text, role="page_translation")
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        return translate_page(first_half) + "\n" + translate_page(second_half)

def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]
    translated_titles = translate_titles(title_list)

    for original_title, translated_title in zip(title_list, translated_titles):
        title_translation_dict[original_title] = translated_title

    for page, translated_title in zip(data['pages'], translated_titles):
        page['title'] = translated_title

    # Translate lines with translated titles replaced
    for page in data['pages']:
        page_text = "\n".join(page['lines'])

        for jp_title, en_title in title_translation_dict.items():
            page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")

        translated_text = translate_page(page_text)
        page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')

このコードでは、タイトル翻訳用に translate_titles 関数を、ページ翻訳用に translate_page 関数を作成しています。

ありがとう。今はChatCompletionを同期的に呼んでいると思うけど、これを並列・非同期処理にしてほしい。

了解しました。以下のコードでは、aiohttpとasyncioを使って、ChatCompletionを非同期で並列処理するように変更しました。これにより、翻訳のパフォーマンスが向上するはずです。

まず、aiohttpとasyncioをインストールしてください。

bashCopy code
pip install aiohttp asyncio

次に、以下のコードを実行します。

pythonCopy code
import json
import asyncio
import aiohttp
import openai

MAX_TOKENS = 8000

async def async_translate(session, text, role="user"):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }

    data = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": role, "content": text}
        ]
    }

    async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
        response = await resp.json()
        return response["choices"][0]["message"]["content"]

async def translate_titles(session, title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = await async_translate(session, title_chunk, role="title_translation")
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = await async_translate(session, title_chunk, role="title_translation")
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

async def translate_page(session, page_text):
    if len(page_text) <= MAX_TOKENS:
        return await async_translate(session, page_text, role="page_translation")
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        first_half_translated = await async_translate(session, first_half, role="page_translation")
        second_half_translated = await translate_page(session, second_half)
        return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]

    async with aiohttp.ClientSession() as session:
        translated_titles = await translate_titles(session, title_list)

        for original_title, translated_title in zip(title_list, translated_titles):
            title_translation_dict[original_title] = translated_title

        for page, translated_title in zip(data['pages'], translated_titles):
            page['title'] = translated_title

        # Translate lines with translated titles replaced
        for page in data['pages']:
            page_text = "\n".join(page['lines'])

            for jp_title, en_title in title_translation_dict.items():
                page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")

            translated_text = await translate

continue from this line

My apologies for the incomplete response. Here is the complete code:

pythonCopy code
            translated_text = await translate_page(session, page_text)
            page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
    await translate_json_file('input.json', 'output.json')

asyncio.run(main())

In the updated code, we have defined the main function as asynchronous, and then use asyncio.run(main()) to run the async translate_json_file function. This way, the translation process takes advantage of asynchronous processing, which should improve the performance of the translation tasks.

2 / 2ありがとう。修正点として、roleは常にsystemかuserであるべき。あと、それぞれのtranslate_pageはconcurrentlyにやってほしい。

了解しました。以下のコードでは、roleを常にsystemかuserに設定し、translate_pageの処理を並行して実行するように変更しました。

pythonCopy code
import json
import asyncio
import aiohttp
import openai

MAX_TOKENS = 8000

async def async_translate(session, text):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }

    data = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": "user", "content": text}
        ]
    }

    async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
        response = await resp.json()
        return response["choices"][0]["message"]["content"]

async def translate_titles(session, title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = await async_translate(session, title_chunk)
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = await async_translate(session, title_chunk)
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

async def translate_page(session, page_text):
    if len(page_text) <= MAX_TOKENS:
        return await async_translate(session, page_text)
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        first_half_translated = await async_translate(session, first_half)
        second_half_translated = await translate_page(session, second_half)
        return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]

    async with aiohttp.ClientSession() as session:
        translated_titles = await translate_titles(session, title_list)

        for original_title, translated_title in zip(title_list, translated_titles):
            title_translation_dict[original_title] = translated_title

        for page, translated_title in zip(data['pages'], translated_titles):
            page['title'] = translated_title

        # Translate lines with translated titles replaced
        translation_tasks = []

        for page in data['pages']:
            page_text = "\n".join(page['lines'])

            for jp_title, en_title in title_translation_dict.items():
                page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")

            translation_tasks.append(translate_page(session, page_text))

        translated_texts = await asyncio.gather(*translation_tasks)

        for page, translated_text in zip(data['pages'], translated_texts):
            page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input

continue

Here is the continuation of the code:

pythonCopy code
# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
    await translate_json_file('input.json', 'output.json')

asyncio.run(main())

In this updated code, the translate_page tasks are executed concurrently using asyncio.gather. This should further improve the performance of the translation process.

4 / 4Before sending text to OpenAI, - replace indenting space/tab/full width space in the beginning of each line to \s. If there's n number of spaces, replace them to n number of \s. - replace newline to \n. After translation, replace those back to space and newline. Regex might be useful. Modify async_translate to do this.

To achieve this, we'll use the re module for handling regular expressions. Here's the modified async_translate function that replaces leading spaces/tabs/full width spaces with \s and newlines with \n before translation, and then reverts them back after translation:

pythonCopy code
import re

async def async_translate(session, text):
    # Replace leading spaces/tabs/full width spaces with \s
    text = re.sub(r'^([ \t　]+)', lambda m: '\\s' * len(m.group(1)), text, flags=re.MULTILINE)

    # Replace newlines with \n
    text = text.replace('\n', '\\n')

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }

    data = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": "user", "content": text}
        ]
    }

    async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
        response = await resp.json()
        translated_text = response["choices"][0]["message"]["content"]

    # Replace \n back to newline
    translated_text = translated_text.replace('\\n', '\n')

    # Replace \s back to spaces
    translated_text = re.sub(r'\\s', ' ', translated_text)

    return translated_text

Now, the async_translate function replaces the specified characters before sending the text for translation and reverts them back in the translated text.