Experimenting with real-time machine transcription

I present here some experimentation with the real-time speech-to-text transcription tools from Amazon and Google. A short video demo:

Transcription results from both streaming and batch processes, along with a clean, human-corrected transcript are attached here as text files:

Transcription Experiment - OGM call 2020-09-10 - Human-corrected Transcription.txt (1.4 KB)
Transcription Experiment - OGM call 2020-09-10 - AWS Transcribe Streaming.txt (1.3 KB)
Transcription Experiment - OGM call 2020-09-10 - Google Speech-to-Text Streaming.txt (1.3 KB)
Transcription Experiment - OGM call 2020-09-10 - AWS Transcribe Batch.txt (1.5 KB)
Transcription Experiment - OGM call 2020-09-10 - Google Speech-to-Text Batch.txt (1.2 KB)
Transcription Experiment - OGM call 2020-09-10 - Google Speech-to-Text Batch with Punctuation.txt (1.2 KB)

(Some data are left out, such as timing and alternative, lesser-confidence alternative transcriptions.)

Along with the practical rewards of transcribing OGM and similar calls, I think there are potential business opportunities here, and I am thinking about exploring them more seriously.

  • The underlying machine transcription technology is “good enough” (while not yet perfect, which means there is opportunity).
  • There are not many vendors offering productized versions of the technologies (biggest are probably Otter and Descript).
  • There are a number of potential models for productization that haven’t been explored.

Some potential products or services:

  • Transcribe virtual calls or physical meetings (the latter subject to pandemic precautions).
  • Real-time machine transcription into a collaborative text editor, where one or more humans could clean up and improve the transcription. The result could be near-live and near-perfect transcription.
  • Offer the above as a product, or offer the above as a service.
  • Use the cleaned-up near-live transcription as input for graphic facilitators, narrative facilitators, or highlight facilitators, who would feed images, narrative structure, or highlights back into the meeting.
  • Use the above to offer the ability for someone to attend to multiple calls simultaneously, as they multitask but keep up on each call’s written transcript.
  • Use machine transcription to recognize key phrases, then have a bot look up those phrase in Google, Jerry’s Brain, etc., as Jerry suggested recently on the OGM mailing list.

Anyway, for this experiment, I played my check-in from the most recent OGM check-in call from YouTube, simulating a live Zoom call, and ran the audio into both Amazon and Google transcription in streaming (real-time) mode. I also ran the whole call through batch transcription with both Amazon and Google.

The transcription starts with the end of Tony’s check-in, then Jerry switching over to me, then my check-in, then Jerry switching over to Avril. My voice and speech quality is okay but not great - volume varies, and there are a number of filler words (um, uh) to navigate.

All of the machine transcriptions are good, but not perfect. Some quick observations:

  • This is a technology demo, of course, not a product yet. But it still shows great promise.
  • The Google transcription colors are red for “draft” transcription, and green for a completed chunk. The distracting duplication of lines is an artifact of the library I’m using, which can handle one draft line of text, but duplicates lines when there is more than one draft line of text.
  • Google’s streaming latency is really good – almost real-time. (This is similar to what I see on my Pixel smartphone with Google’s “Live Caption” feature, by the way.)
  • Amazon’s streaming latency is a little challenging – it sometimes takes seconds to catch up. I am guessing it’s not the backend, but rather that library I’m using to transfer the audio to the cloud, and it might be possible to get low latency with Amazon streaming, too.
  • Punctuation is really nice. The default for Google is “no punctuation”, but you can enable punctuation for Google, which I did with one Google batch sample. I missed enabling it for the streaming sample.
  • Less-common words and names are generally harder for the machines to recognize. This would make it hard to do the auto-look-up bot work without human assistance.

I am expecting to continue to play around with these technologies – let me know if you’d like to be involved.

4 Likes

That’s excellent! So are you going to run this for all OGM calls, adding the transcripts to a post on this Discourse here (which might be good enough for OGM purposes)?

I’m open to it, but mostly I’m still experimenting.

Lauren, and I think Jerry, have posted some Otter transcripts for calls, and I’ve posted one or two mostly raw machine transcriptions of calls, but I haven’t seen any feedback about whether those are useful.

(To be fair, it’s not clear that people have really known about them. We’re still getting organized about posting call recordings and associated artifacts at all.)

So I guess I still have questions like these, which we are, of course, working on figuring out:

  • Is a batch machine transcription of a call useful for others?
  • Should we do that with Otter (more automated, more $), or “by hand” with AWS or Google (little $, but some more work)?
  • Is it useful for human(s) to improve the transcription itself? And/or, how does that fit in with speaker “swim lanes” on Miro and index of recordings?
  • How hard would it be to create an automation pipeline to replace the “by hand” creation of AWS or Google transcription?
  • Is AWS or Google (or both, automagically combined) better for OGM and similar calls? Or for other use cases?
  • Can collaborative real-time transcription be used (rather than batch transcription) to decrease the amount of post-call work in generating corrected transcripts, narrative arcs, indexes, etc.?
  • Are there business opportunities created by creating processes and tools for ourselves?
1 Like

Of course, I’m just kind of softly pushing for it in case the OGM calls get expanded, potentially so much so that the amount of record explodes and nobody has any way any more for catching up or mining material from it, as I’m wondering if a global mind would benefit more if the tooling keeps track with the content creation, for sensemaking and stuff.

There’s other groups who talked about working with their piles of recordings in some way, or partially tried to do some practical curation of them, but most of it got abandoned/scrapped as they’re always low on dev capacity, or intentionally “no-tech”/“low-tech” (an entirely confused idea, as there’s plenty of code written by somebody else doing all the tasks for them, and preventing other uses).

Therefore, it’s already an achievement of OGM that such technical work on handling + processing the transcripts is progressing. What this means or enables content- and sensemaking-wise, I think I’ve written quite a lot about that already (in many places or to many people, I might add), but it could indeed turn out to be the case that this is only useful/interesting to a small minority like ourselves, while a majority of attendees/participants much more enjoys just the community event aspect of it, and less the sensemaking/learning/analysis/coordination parts. Let’s file it as a “social experiment” :slight_smile:

1 Like

Related work by Otter:

I think Siri, Ok-Google, Alexa, Cortana are doing this for some time now, and eventually it gets into Otter, Zoom, Teams as well. Requires their big proprietary databases of collected human speech to maintain and improve decent speech recognition, with also a lot of humans involved curating/correcting it.

1 Like

Another vendor with value-added machine transcription.

“Meet Myna. Searchable smart scripts, video and audio recordings of your most important calls.” $9/user/mo after 30-day free trial.

Touted features:

  • SmartTranscript delivered to your inbox
  • Connects to Cloud and local recorders
  • Auto topics give a summary of what was said
  • No Minimum amount of users
  • Search by any word or phrase
  • Add custom topics
  • Grabs meetings direct from Dropbox
  • Editable so you can update mis-transcriptions
  • “Hot Button” to mark interesting points
  • Add Custom Vocabulary
  • Topic and person search
  • Forward the transcript to colleagues and customers
  • Jump to any point in the conversation
  • Alternative word search boosts retrieval
  • “Leave” your meeting, but still review it later
  • English, French, Spanish and Brazilian Portuguese
  • Helps with “Zoom Fatigue”
  • Calendar Driven Myna recorder
  • Grab any content from shared documents and presentations
  • Optional archive to Dropbox (Google drive coming soon)
  • “Attend” two meetings at once
2 Likes

Is there a “Myna your own business” mode? groan

1 Like

Great work @peterkaminski!

A couple of things to pop on the radar…

https://dictanote.co/ is another realtime transcription product. It is less polished than Otter, but I find the transcription impressive. I’m not sure what service they are using in the backend.

I have recently started using the air.io podcast player. It has a feature where you can take “AirrQuotes” - a similar concept to adding a marker in an audio editing app. The idea is that you add your own written notes to capture the significance of something said. Where a full text transcription exists for the podcast, it picks that up and matches it with the audio snippet.
The interesting use case I can imagine in a meeting transcription product context would be “hmmm… Jerry just made an interesting point, I’m going to press the button to add a marker so that it is added to my ‘highlights list’ for the meeting”.

Airr have an integration with Roam, so that use case is a logical extension in a meeting context.

I’m keen to keep across this experiment.

1 Like

I want this. I do make notes like that but matching it with the bits is something that could be automated.

1 Like

Real time exchange and relating of own short notes would be great.
That is something that I would like to add to the peer to peer collaboration patterns in TrailMarks which already gives people their own SlipStream and have the possibility of near real time CRDT based collaboration.

In the meantime, I just come across with the work of Dan Whaley founder of hypothesis





Dan Whaley
@dwhly

So, I’ve prototyped a YouTube video annotator using

@hypothes_is

. It works for any (95%+) YT video w/ a transcript. Go to http://docdrop.org to try it. Feedback welcome!

https://docdrop.org/

I understand that all is required to have transcript uploaded with the video

Faced with hours long amazing calls, it would be accelerating idea flow if participants could simply annotate comment and conduct conversations right there

2 Likes

This won’t happen super-soon, but NVIDIA has announced Maxine, a “Cloud-AI Video-Streaming Platform.”

“Maxine includes latest innovations from NVIDIA research such as face alignment, gaze correction, face re-lighting and real time translation in addition to capabilities such as super-resolution, noise removal, closed captioning and virtual assistants. These capabilities are fully accelerated on NVIDIA GPUs to run in real time video streaming applications in the cloud.”

Maxine uses NVIDIA’s Jarvis.

“Using Jarvis, developers can integrate virtual assistants to take notes, set action items, and answer questions in human-like voices. Additional conversational AI services such as translations, closed captioning and transcriptions help ensure everyone can understand what’s being discussed on the call.”

So theoretically someday, NVIDIA GPUs in the cloud will enable real-time transcription and translation for videoconferencing.

The first Zapp in the image is “Live Transcript”. :slight_smile:

Zoom’s Zapps page:

1 Like

One of the Zapps launch partners is Rev captions and transcripts:

1 Like

More companies in the space!

Marsview Notes - Meeting Notes, Recording & Transcription (h/t @lovolution)

Peter,

Would be glad to see how I can help. I can think of the healthcare space where collective intelligence will be of great value in improving quality of care and making it accessible for everyone. There are multiple data sources and channels where data and information flow. I wonder if this is a good use case to provide value by providing the solution to capture, store, organize, link and unbox the collective intelligence? This will require collaboration with EMR companies, health exchanges, health systems etc. to gain more clarity of their respective current state and directions.

Also, I’ll be glad to try the product using our recorded Zoom meeting during our workshop and maybe to some other personal meetings that I’ll be conducting. Just let me know what works for you.

Thanks,
-Romer