Transcribing can be a time-consuming task. Whether you’re a journalist, lawyer, student, or in any other profession where transcribing from audio/video to text forms part of the job, finding an accurate and reliable way do it automatically, could save you valuable hours.
Today there are a number of services available that can do this. However, those that employ humans to transcribe your audio, although generally very accurate, are expensive.
As with many lines of work in 2020, AI and computer software is making headway into this arena and is a much more cost-effective way of turning your audio into text.
One such service that we have put under the microscope lately, is Transcribe by Wreally. There’s been such a good buzz about them, we decided to strap on the cynical hat and give their services a trial, to see how they hold up to closer scrutiny.
What follows is an in-depth look into Transcribe’s services, plus some of thoughts on the good and bad points of using them for your transcribing needs.
Who is Wreally Transcribe
Wreally is a limited liability company (LLC) specializing in software product development, with focus on developing web applications that can be deployed through the cloud.
- Transcribe – A cloud-based transcription software
- Codassium – An application for conducting interviews.
- Scribble – An application for note-taking.
This review focuses on Transcribe.
A Quick Overview of Wreally Transcribe Services
Transcribe is a transcription software hosted in remote servers and deployed using the SaaS model, which allows its speech-to-text-artificial intelligence (STT-AI) engine to use the computing resources of the host server to perform transcription and then output the machine-generated transcript to the user via a browser-dependent app.
This app is built as a suite that includes a transcription engine, media player, and an integrated text editor with word expander capability. It thus operates as a speech recognition system with voice recognition and STT functionalities.
Transcribe supports the following 7 audio formats: mp3, wma, wav, m4a, aac, amr, and mp4. One can also upload a YouTube video directly from YouTube’s servers into the app’s database for transcription.
It also offers two types of license, the individual and group licenses; and regardless of the license type chosen, the first week of use is free of charge.
Furthermore, Transcribe provides multilingual support, and can convert audio files made in the following languages: American and British English, Mandarin, Arabic, Portuguese, Spanish, French, German, Hebrew, Hindi, Japanese, Korean, Italian, Dutch, Russian, Romanian, Catalan, and Greek.
Relatedly, the process of STT transcription as done by Transcribe is supported by an algorithmic framework that allows its speech recognition engine to structure the speech into a tiered hierarchy that is set as described hereafter.
Phonemes are identified, and these phenomes are used to build words based on the probabilistic rules set in the AI – to match the spoken word to its dictionary equivalent.
Thereafter, the words are combined to form syntactically-acceptable phrases based on deterministic rules set in the AI. Finally, the phrases are used to create sentences which are parsed by the grammar engine, and this can cause
Transcribe to dismiss syntactically-unsound sentences, even though they are audibly clear in the audio file. The deterministic rules are used to create an artificial neural network based on UPM, HMM, and LTSM models.
Basically, this means that rules are used to create an artificial neural network in the AI that can identify, process, and transcribe a speech into text.
How to use Transcribe by Wreally – the Process
To use the Transcribe web-app for transcription, one needs to carry out the following steps.
- Sign-up by providing the full names, working email addresses, and creating a unique password that is to be associated with the account.
- One chooses the license type from either of the two available options – individual license or group license. Thereafter, the account is set up and its login credential provided. Regardless of the account chosen, Transcribe does not charge anything for the first week of use, that is (i.e), the first week is free of charge.
- One needs to login to Transcribe. At this moment, one realizes that (s)he is moved away from the webpage to the hosted app domain so that (s)he can start using Transcribe as a SaaS.
- One can upload the audio file. The supported audio formats are mp3, wma, wav, m4a, aac, amr, and mp4 (which is also a video format). Also, one can load a YouTube video into the app by simply copying the web address (or uniform resource locator [URL]) into the address bar provided via the YouTube Load button.
- Choose the mode of transcription and language. Transcribe provides 2 modes:
This allows one to play the audio clip via the headphone and then one speaks into the microphone what (s)he hears in the clip.
This is a form of read speech, and it eliminates most of the demerits associated with spontaneous speech. In Wreally’s terms of service (ToS), this dictation mode is charged as Automatic Transcription and is billed at USD6 per hour, in addition to the flat annual rate of USD20 per year.
This plays a few seconds of the clip, then pauses and rewinds this segment. This segment serves as an audio sample. One then types what (s)he hears into the text editor which automatically saves any text typed.
This autosave feature is set at 15 seconds by default, which means any text typed within 15s is automatically saved. Also, the app allows one to set the length of time of the audio segment. According to Wreally’s ToS, one only needs to pay the annual flat rate to use this mode.
- The text in the document is edited to remove errors in the machine-generated transcript.
- Document is exported as a .doc file. This file can be opened by popular word processors such as Microsoft Office’s Word and LibreOffice’s Writer.
Here’s a look at some of the built-in features that Transcribe contains:
Integrated Text Editor
Transcribe integrated text editor is synced with the media player so as to allow for time stamps to be included in the text, which is quite useful for speeches and video subtitles (as it allows one to synchronize the video images based on their time stamps to the right subtitle segment).
The integrated text editor is a fully functional word processor that supports document formatting, as well as text formatting using the options of bold, italics, underline, list creation, and addition of hyperlinks.
This allows one to edit the text document so as to remove the error and reduce WER to almost 0%. Also, because this text processor allows one to set it to expand specific shorthand characters such as imo into ‘in my opinion’, then it is considered to have a word expander capability.
Foot Pedal Support
Transcribe also supports a foot pedal. This is a transcriber foot pedal that is used as an input device for controlling media playback.
Transcribe uses Google’s dictionary in its integrated text editor, and for this reason, the web app is best used with the Chrome browser if one is using Windows Operating System (OS) or MacOS, or Chromium browser if one is using a Linux OS or BSD-based OS.
Also, the autosave feature in the integrated text editor saves the text in the browser’s cache memory, and one must not clear the cache unless (s)he has exported the text, or else one will lose the saved text.
As mentioned, Transcribe supports multilingual transcription as it uses several dictionaries, each dictionaries being specific to a particular language.
Currently, it supports American and British English, Mandarin, Arabic, Portuguese, Spanish, French, German, Hebrew, Hindi, Japanese, Korean, Italian, Dutch, Russian, Romanian, Catalan, and Greek.
Factors that Impact Transcriptions using Transcribe
The quality of the machine-generated transcription obtained using the dictation mode is determined by the quality of audio residue that is fed into the STT-AI.
This audio residue is determined by the audio quality of the spoken-word clip. Therefore, the source audio clip impacts the quality of VRAI-powered transcription done by Transcribe.
The following qualities of this source audio clips affect the quality of the machine-generated transcript:
- Quality of sound in the audio recording or video file.
- Background noise, including echoes and noises caused by electrical phenomena like poor grounding of the microphone’s amplifier. It can even include factory noise or traffic noise.
- The accent of the speaker, as well as the nasality, loudness, and speed of his/her speech.
- The distance of the speaker from the microphone.
- Type of audio encoding used.
- Quality of audio encoding used.
As expected, the dictation mode eliminates most of the faults in the source audio file by requiring the user to dictate clean audio to the app, which ensures that Transcribe works with a high-quality audio input.
Even so, if the user cannot make out the words due to loud background noise, poor source audio quality, and inability to understand the accent, then some words may be missed.
Also, there are SRS-dependent faults that affect the quality of the machine-generated transcript. These faults are:
- Size and range of vocabulary in its database.
- Confusability – The SRS can confuse between 2 different words that have similar pronunciation (i.e homonyms) or mispronounced words.
- Syntactic constraint – The SRS system contains a grammar engine for creating human-readable text, and this engine can ignore syntactic errors or awkward sentences such as “the mango gave birth” even if the words are audibly clear in the speech.
Other factors that impact on the quality of the machine-generated transcript are:
Capability to transcribe continuous speech
Also, the capacity of the STT-AI of the SRS to transcribe isolated and discontinuous speech affects the transcription quality.
Normally, in discontinuous speech, full sentences are separated from each other by silence, which allows Transcribe to immediately differentiate between two sentences, and include the right punctuation marks.
Transcribe can also transcribe continuous speech that has been recorded at the speed of normal human conversation.
This means that one cannot submit an audio clip playing at x2 the normal conversation speed for quick transcription.
Ability to work on spontaneous speech versus a read speech
This is determined by the quality of enrollment of the AI powering the SRS. Moreover, spontaneous speech has a limited vocabulary and contains stuttering, incomplete sentences, and frequent interjections (such as ah, oh, umm); as well as sounds of laughter, coughing, and even nose blowing!
On the other hand, read speech is usually vocabulary-rich and is made at a measured speech speed with discernible silence pauses between sentences.
It is for this reason that the dictation mode requires one to convert spontaneous speech into reading speech by speaking what one hears back to the app.
Accuracy of the machine-generated transcript
The accuracy of any machine-generated transcript is determined by the number of mistakes found in this transcript when it is compared to a manual transcript made by a specialist who has exactly reproduced what was said in the audio recording.
This accuracy is measured as the word error rate (WER) that can be expressed as an equation:
Accuracy (WER) = (Number of mistakes in words in machine-generated transcript / Total number of words in the accurate manual transcript) x 100
As expected, the accurate manual transcript is assumed to have a word recognition rate of 100 percent. Also, the WER increases with the length of the audio clip.
To reduce WER, Wreally’s Transcribe takes the word recognized by its speech recognition engine and references this word to its equivalent in the dictionary to obtain the spelling, as well as reference if the pronunciation of the spoken word closely approximates the dictionary-form pronunciation.
This referencing is done using dynamic string alignment so that each phenome or character of the recognized spoken word is compared to the corresponding phenome/character of its dictionary equivalent.
The Pros and Cons of Using Transcribe