The good news about AI Joanna: She never loses her voice, she has outstanding posture and not even a convertible driving 120 mph through a tornado could mess up her hair.

關于人工智能喬安娜的好消息是:她從來不會失聲,她的姿勢很好,即使是在龍卷風中以120英里每小時的速度行駛的敞篷車也不會弄亂她的頭發(fā)。

The bad news: She can fool my family and trick my bank.

壞消息是:她可以騙過我的家人和銀行。

Maybe you’ve played around with chatbots like OpenAI’s ChatGPT and Google’s Bard, or image generators like Dall-E. If you thought they blurred the line between AI and human intelligence, you ain’t seen—or heard—nothing yet.

也許你已經(jīng)和 OpenAI 的 ChatGPT 和 Google 的 Bard 等聊天機器人玩過了,或者使用過像 Dall-E 這樣的圖像生成器。如果你認為它們模糊了人工智能和人類智能之間的界限,那么你還沒有看到或聽到最新的進展。

Over the past few months, I’ve been testing Synthesia, a tool that creates artificially intelligent avatars from recorded video and audio (aka deepfakes). Type in anything and your video avatar parrots it back.

在過去的幾個月里,我一直在測試Synthesia,這是一個從錄制的視頻和音頻(又名深度偽造)中創(chuàng)建人工智能化身的工具。輸入任何東西,你的視頻化身就會鸚鵡學舌地回答。
原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


Since I do a lot of voice and video work, I thought this could make me more productive, and take away some of the drudgery. That’s the AI promise, after all. So I went to a studio and recorded about 30 minutes of video and nearly two hours of audio that Synthesia would use to train my clone. A few weeks later, AI Joanna was ready.

由于我做了很多語音和視頻工作,我認為這可以讓我更有效率,減少一些枯燥乏味的工作。畢竟,這就是AI的承諾。所以我去了一個工作室,錄制了大約30分鐘的視頻和近兩個小時的音頻,供Synthesia用來訓練我的克隆。幾周后,AI喬安娜準備就緒。

Then I attempted the ultimate day off, Ferris Bueller style. Could AI me—paired with ChatGPT-generated text—replace actual me in videos, meetings and phone calls? It was…eye-opening or, dare I say, AI-opening. (Let’s just blame AI Joanna for my worst jokes.)

然后我嘗試了類似 Ferris Bueller 那樣的完美假期。能否讓 AI 版的我(與 ChatGPT 生成的文本配合)取代真正的我出現(xiàn)在視頻、會議和電話中?這是…開了眼界,甚至可以說是開了 AI 的眼界。(讓我們把最糟糕的笑話歸咎于 AI喬安娜吧。)

Eventually AI Joanna might write columns and host my videos. For now, she’s at her best illustrating the double-edged sword of generative-AI voice and video tools.

最終,AI 喬安娜可能會寫專欄,主持我的視頻。目前,她最擅長揭示生成式人工智能語音和視頻工具這把雙刃劍。

My video avatar looks like an avatar.
Video is a lot of work. Hair, makeup, wardrobe, cameras, lighting, microphones. Synthesia promises to eradicate that work, and that’s why corporations already use it. You know those boring compliance training videos? Why pay actors to star in a live-action version when AI can do it all? Synthesia charges $1,000 a year to create and maintain a custom avatar, plus an additional monthly subscxtion fee. It offers stock avatars for a lower monthly cost.

我的視頻形象看起來像一個化身(尤指電腦游戲或聊天室中代表使用者的)。
制作視頻需要很多工作。包括發(fā)型、化妝、服裝、攝像機、燈光和麥克風。Synthesia承諾可以消除這些工作,這就是為什么企業(yè)已經(jīng)在使用它。你知道那些無聊的合規(guī)性培訓視頻嗎?為什么要支付演員來出演一個真人版,而當人工智能可以做到一切?Synthesia每年收取1,000美元的費用來創(chuàng)建和維護一個定制化身,另外還有額外的每月訂閱費用。它以較低的每月花費提供庫存頭像。

I asked ChatGPT to generate a TikTok scxt about an iOS tip, written in the voice of Joanna Stern. I pasted it into Synthesia, clicked “generate” and suddenly “I” was talking. It was like looking at my reflection in a mirror, albeit one that removes hand gestures and facial expressions. For quick sentences, the avatar can be quite convincing. The longer the text, the more her bot nature comes through. See for yourself in my video.

我請求ChatGPT生成一個以喬安娜·斯特恩的聲音寫出的關于iOS技巧的TikTok稿子。我將其粘貼到Synthesia中,點擊“生成”,然后“我”就在說話了。這就像是在鏡子里看著自己的影像,盡管它沒有手勢和面部表情。對于簡短的句子,化身可能會相當令人信服。文字越長,其機器人的本質就越明顯。在我的視頻中親眼看看。

On TikTok, where people have the attention span of goldfish, those computer-like attributes are less noticeable. Still, some quickly picked up on it. For the record, I would rather eat live eels than utter the phrase “TikTok fam” but AI me had no problem with it.

在TikTok上,人們的注意力持續(xù)時間像金魚一樣長,這些類似電腦的特征就不那么明顯了。不過,有些人很快就注意到了。就我個人而言,我寧愿吃活鰻魚也不愿說“抖音家族”(家人們),但是AI的我并沒有問題。

The bot-ness got very obvious on work video calls. I downloaded clips of her saying common meeting remarks (“Hey everyone!” “Sorry, I was muted.”) then used software to pump them into Google Meet. Apparently AI Joanna’s perfect posture and lack of wit were dead giveaways.

這種機器人性質在工作視頻通話中表現(xiàn)得非常明顯。我下載了她在會議上講話的片段(“大家好!“對不起,我靜音了?!?,然后用軟件將它們注入Google Meet。顯然,AI喬安娜完美的姿勢和缺乏智慧是顯而易見的。

All this will get better, though. Synthesia has some avatars in beta that can nod up and down, raise their eyebrows and more.

不過這一切都將變得更好。Synthesia正在測試一些beta版本的化身,它們可以向上向下點頭,揚起眉毛等等。
原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


My AI voice sounds a lot like me.
When my sister’s fish died, could I have called with condolences? Yes. On a phone interview with Snap CEO Evan Spiegel, could I have asked every question myself? Sure. But in both cases, my AI voice was a convincing stand-in. At first.

我的AI語音聽起來很像我自己。
當我姐姐的魚去世時,我可以打電話表示慰問嗎?可以。在與Snap CEO Evan Spiegel的電話采訪中,我可以自己問每個問題嗎?當然可以。但在這兩種情況下,我的AI語音都是一個令人信服的替身。起初是這樣的。

I didn’t use Synthesia’s voice clone for those calls. Instead, I used one generated by ElevenLabs, an AI speech-software developer. My producer Kenny Wassus gathered about 90 minutes of my voice from previous videos and we uploaded the files to the tool—no studio visit needed. In under two minutes, it cloned my voice. In ElevenLabs’s web-based tool, type in any text, click Generate, and within seconds “my” voice says it aloud. Creating a voice clone with ElevenLabs starts at $5 a month.

我沒有使用Synthesia的語音克隆進行這些通話。相反,我使用了一個由AI語音軟件開發(fā)商ElevenLabs生成的語音克隆。我的制作人肯尼·沃瑟斯從以前的視頻中收集了大約90分鐘的我的語音,然后我們上傳了這些文件到該工具中-無需參觀錄音室。不到兩分鐘,它就克隆了我的語音。在ElevenLabs的基于Web的工具中,輸入任何文本,單擊“生成”,幾秒鐘內,“我”的語音就會大聲說出來。使用ElevenLabs創(chuàng)建語音克隆的費用起價為每月5美元。

Compared with Synthesia Joanna, the ElevenLabs me sounds more humanlike, with better intonations and flow.

與Synthesia喬安娜相比,ElevenLabs的我聽起來更像人類,語調和流暢性更好。

My sister, whom I call several times a week, said the bot sounded just like me, but noticed the bot didn’t pause to take breaths. When I called my dad and asked for his Social Security number, he only knew something was up because it sounded like a recording of me.

我每周給姐姐打幾次電話,她說 AI 聲音聽起來和我一模一樣,但注意到 AI 沒有停下來呼吸。當我打電話給父親并詢問他的社會保障號碼時,他只是感覺聽起來像我的錄音。

The potential for misuse is real.
The ElevenLabs voice was so good it fooled my Chase credit card’s voice biometric system.

濫用的可能性是真實存在的。
ElevenLabs的聲音太棒了,騙過了我大通銀行(摩根大通子公司)信用卡的聲音生物識別系統(tǒng)。

原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


I cued AI Joanna up with several things I knew Chase would ask, then dialed customer service. At the biometric step, when the automated system asked for my name and address, AI Joanna responded. Hearing my bot’s voice, the system recognized it as me and immediately connected to a representative. When our video intern called and did his best Joanna impression, the automated system asked for further verification.

我用一些我知道大通銀行會問的問題為AI喬安娜做好了準備,然后撥打了客戶服務電話。在生物識別步驟中,當自動系統(tǒng)要求我的姓名和地址時,AI喬安娜進行了回應。聽到我的機器人的聲音,該系統(tǒng)將其識別為我,并立即連接到了代表。當我們的視頻實習生打電話并盡力模仿喬安娜時,自動系統(tǒng)要求進一步驗證。

A Chase spokeswoman said the bank uses voice biometrics, along with other tools, to verify callers are who they say they are. She added that the feature is meant for customers to quickly and securely identify themselves, but to complete transactions and other financial requests, customers must provide additional information.

大通銀行的一位女發(fā)言人表示,該銀行使用語音生物識別技術和其他工具來驗證呼叫者是否真實身份。她補充說,該功能旨在幫助客戶快速安全地確認自己的身份,但是為了完成交易和其他金融請求,客戶必須提供額外的信息。

What’s most worrying: ElevenLabs made a very good clone without much friction. All I had to do was click a button saying I had the “necessary rights or consents” to upload audio files and create the clone, and that I wouldn’t use it for fraudulent purposes.

最令人擔憂的是:ElevenLabs幾乎沒有阻力就成功創(chuàng)建了一個非常好的克隆。我所需要做的就是點擊一個按鈕,表示我擁有上傳音頻文件并創(chuàng)建克隆的“必要權利或同意”,并且不會將其用于欺詐目的。

That means anyone on the internet could take hours of my voice—or yours, or Joe Biden’s or Tom Brady’s—to save and use. The Federal Trade Commission is already warning about AI-voice related scams.

這意味著互聯(lián)網(wǎng)上的任何人都可以花費幾個小時來制作和使用我的聲音或您的聲音、喬·拜登或湯姆·布雷迪的聲音。聯(lián)邦貿易委員會已經(jīng)警告人們要注意與AI聲音相關的騙局。

Synthesia requires that the audio and video include verbal consent, which I did when I filmed and recorded with the company.

Synthesia要求音頻和視頻包括口頭同意,我在與該公司進行拍攝和錄制時遵守了這一要求。
原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


ElevenLabs only allows cloning in paid accounts, so any use of a cloned voice that breaks company policies can be traced to an account holder, company co-founder Mati Staniszewski told me. The company is working on an authentication tool so people can upload any audio to check if it was created using ElevenLabs technology.

ElevenLabs只允許付費賬戶進行克隆,因此違反公司政策的任何克隆聲音使用都可以追蹤到一個賬戶持有人。該公司的聯(lián)合創(chuàng)始人Mati Staniszewski告訴我,該公司正在開發(fā)身份驗證工具,以便人們上傳任何音頻并檢查其是否使用了ElevenLabs技術創(chuàng)建的。

Both systems allowed me to generate some horrible things in my voice, including death threats.

這兩個系統(tǒng)都允許我用自己的聲音生成一些可怕的東西,包括死亡威脅。

A Synthesia spokesman said my account was designated for use with a news organization, which means it can say words and phrases that might otherwise be filtered. The company said its moderators flagged and dexed my problematic phrases later on. When my account was changed to the standard type, I was no longer able to generate those same phrases.

Synthesia的一位發(fā)言人表示,我的賬戶被指定用于新聞機構,這意味著它可以說出可能被過濾的單詞和短語。該公司表示,其審核員后來標記并刪除了我有問題的短語。當我的賬戶更改為標準類型后,我再也不能生成相同的短語了。

Mr. Staniszewski said ElevenLabs can identify all content made with its software. If content breaches the company’s terms of service, he added, ElevenLabs can ban its originating account and, in case of law breaking, assist authorities.

Staniszewski先生表示,ElevenLabs可以識別其軟件創(chuàng)建的所有內容。如果內容違反了公司的服務條款,他補充說,ElevenLabs可以封禁其來源賬戶,在違法的情況下,還可以協(xié)助當局。

This stuff is hard to spot.
When I asked Hany Farid, a digital-forensics expert at the University of California, Berkeley, how we can spot synthetic audio and video, he had two words: good luck.

這種東西很難識別。
當我問加州大學伯克利分校的數(shù)字取證專家哈尼.法里德,我們如何識別合成音頻和視頻時,他說了兩個單詞:祝你好運。

“Not only can I generate this stuff, I can carpet-bomb the internet with it,” he said, adding that you can’t make everyone an AI detective.

他說:“我不僅可以生成這些東西,還可以在互聯(lián)網(wǎng)上發(fā)布大量此類內容?!彼a充道,你不能讓每個人都成為人工智能偵探。

Sure, my video clone is clearly not me, but it will only get better. And if my own parents and sister can’t really hear the difference in my voice, can I expect others to?

當然,我的視頻克隆顯然不是我本人,但它只會越來越好。如果連我的父母和姐姐都聽不出我的聲音差異,我還能指望別人嗎?

I got a bit of hope from hearing about the Adobe-led Content Authenticity Initiative. Over 1,000 media and tech companies, academics and more aim to create an embedded “nutrition label” for media. Photos, videos and audio on the internet might one day come with verifiable information attached. Synthesia is a member of the initiative.

聽說Adobe領導的內容真實性計劃讓我有了一點希望。超過1000家媒體和技術公司、學術界等機構的目標是為媒體創(chuàng)建一個嵌入式“營養(yǎng)標簽(強制在食物包裝列明食物營養(yǎng)數(shù)據(jù))”,互聯(lián)網(wǎng)上的照片、視頻和音頻可能會帶有可驗證的信息。Synthesia是該計劃的成員之一。
原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


I feel good about being a human.
Unlike AI Joanna who never smiles, real Joanna had something to smile about after this. ChatGPT generated text lacking my personality and expertise. My video clone was lacking the things that make me me. And while my video producer likes using my AI voice in early edits to play with timing, my real voice has more energy, emotion and cadence.

我很高興自己是人類。
與從不微笑的AI喬安娜不同,真實的喬安娜在這之后有了可以微笑的事情。ChatGPT生成的文本缺乏我的個性和專業(yè)知識。我的視頻克隆缺乏讓我成為我的那些東西。雖然我的視頻制作人喜歡在早期編輯中使用我的人工智能聲音來調整節(jié)奏,但我的真實聲音更有活力、情感和節(jié)奏感。

原創(chuàng)翻譯:龍騰網(wǎng) http://www.top-shui.cn 轉載請注明出處


Will AI get better at all of that? Absolutely. But I also plan to use these tools to afford me more time to be a real human. Meanwhile, I’m at least sitting up a lot straighter in meetings now.

人工智能會在這些方面做得更好嗎?絕對的。但我也計劃使用這些工具,讓我有更多的時間成為一個真正的人。與此同時,我現(xiàn)在至少在開會時坐直了很多。