How To Convert Speech to Text with Python [Step-by-Step Process]

Introduction to Speech to Text

We are living in an age where the ways we interact with machines have become varied and complex. We have evolved from chunky mechanical buttons to the touchscreen interface. But this evolution is not limited to hardware. The status quo for input for computers has been text since conception. Still, with advancements in NLP (Natural Language Processing) and ML (Machine Learning), we have the tools to incorporate speech as a medium to interact with our gadgets.

These tools already surround us and serve us most commonly as virtual assistants. Google, Siri, Alexa, etc. are milestone achievements in adding another more personal and convenient dimension of interacting with the digital world.

Unlike most technological innovations, speech to text technology is available for everyone to explore, both for consumption and to build your projects. 

Python is one of the most common programming languages in the world has tools to create your speech to text applications.

History of Speech to Text

Before we explore statement to text in Python, it’s worthwhile to appreciate how much progress we have made in this field. The following is the simplified timeline of the :

  • Audrey,1952: the first speech recognition system developed by 3 Bells labs researchers. It could only recognize digits.
  • IBM Showbox (1962): IBM’s first speech recognition system that coils recognize 16 words in addition to digits. Could solve simple arithmetic dictations and print the result.
  • Defense Advanced Research Projects Agency(DARPA) (1970): DARPA funded the Speech Understanding Research, which led to Harpy’s development to recognize 1011 words.
  • Hidden Markov Model(HMM), the 1980s: HMM is a statistical model that models problems requiring sequential information. This model was applied to further advancements in speech recognition. 
  • Voice search by Google,2001: Google introduced the Voice Search feature that enabled users to search using speech. This was the first voice-enabled application that became very popular.
  • Siri,2011: Apple introduced Siri that was able to perform a real-time and convenient way to interact with its devices.
  • Alexa,2014 & google home,2016: Voice command based virtual assistants became mainstream as google home and Alexa collectively sell over 150 million units.

Also Read: Top 7 Python NLP Libraries

Challenges in a Speech to Text 

Speech to text is still a complex problem that is far from being a truly finished product. Several technical difficulties make this an imperfect tool at best. The following are the common challenges with speech recognition technology:

1. Imprecise interpretation

Speech recognition doesn’t always interpret spoken words correctly. VUIs(Voice User Interface) is not as adept as humans in the understanding context that change the relationship between words and sentences. Machines thus may struggle to understand the semantics of a sentence.

2. Time

Sometimes, it takes too long for voice recognition systems to process. This may be owing to the diversity of voice patterns that humans possess. Such difficulty in voice recognition can be avoided by slowing down speech or being more precise in pronunciation, which takes away from the tool’s convenience.

3. Accents

VUIs may find it hard to comprehend dialects that differ from the average. Within the same language, speakers can have wildly different ways of speaking the same words. 

4. Background noise and loudness

In an ideal world, these won’t be a problem, but that’s simply not the case, and so VUIs may find it challenging to work in loud environments (public spaces, big offices, etc.).

Must Read: How to make a chatbot in Python

Speech to Text in Python

If one doesn’t want to go through the arduous process of building a statement to text from the ground up, use the following as a guide. This guide is merely a basic introduction to creating your very own speech to text application. Make sure you do have a functioning microphone in addition to a relatively recent version of Python.

Step 1:

Download the following python packages:

  • speech_recogntion (pip install SpeechRecogntion): This is the main package that runs the most crucial step of converting speech to text. Other alternatives have pros and cons, such as appeal, assembly, google-cloud-search, pocketsphinx, Watson-developer-cloud, wit, etc.
  • My audio (pip install Pyaudio)
  • Portaudio (pip install Portaudio)

Step 2:

Create a project (name it whatever you want), and import the speech_recogntion as sr.

Create as many instances of the recognizer class.

Step 3:

Once you have created these instances, we now have to define the source of the input.

For now, let’s define the source as the microphone itself (you could use an existing audio file)

Step 4:

We will now define a variable to store the input. We use the ‘listen’ method to take information from the source. So, in our case, we will use the microphone as a source that we established in the previous line of code.

Step 5:

Now that we have the input(microphone as source) defined and have it stored in a variable(‘audio’) we simply have to use the recognize_google method to convert it into text. We may store the result in a variable or can simply print the result. We do not have to rely solely on recognize_google, we have other methods that use different APIs that work as well. Examples of such methods are:

recognize_bing()

recongize_google_cloud()

recongize_houndify()

recongize_ibm()

recongize_Sphinx() (works offline too)

The following method used existing packages that help cut down on having to develop your speech to text recognizing software from scratch. These packages have more tools that can help you build your projects that solve more specific problems. One example of a useful feature is that you may change the default language from English to say Hindi. This will change the results that are printed into Hindi ( although as it currently stands, speech to text is most developed to understand English ).

But, it’s a good thought exercise of severe developers to understand how such software runs.

Let’s break it down.

At its most fundamental, speech is simply a sound wave. Such sound waves or audio signals have a few characteristic properties (that may seem familiar to the physics of acoustics) such as Amplitude, crest and trough, wavelength, cycle, and frequency.

Such audio signals are continuous and thus have infinite data points. To convert such an audio signal into a digital signal, such that a computer may process it, the network must take a discrete distribution of samples that closely resembles the continuity of an audio signal.

Once we have an appropriate sampling frequency (8000 Hz is a good standard as most speech frequencies are in this range ), we can now Python libraries such as LibROSA and SciPy process the audio signals. We can then build on these inputs by splitting the data set into 2, training the model, and the other to validate the model’s findings.

At this stage, one may use the model architecture of Conv1d, a convolutional neural network that performs along only one dimension. We can then build a model, define its loss function, and using neural networks to save the best model from converting speech to text. Using deep learning and NLP( Natural Language Processing ), we can refine statement to text for more extensive applications and adoption. 

Applications of Speech Recognition

As we have learned, the tools to run this technological innovation are more accessible because this is mostly a software innovation, and no one company owns it. This accessibility has opened doors for developers of limited resources to come up with their application of this technology.

Some of the fields in which speech recognition is growing are as follows:

  • Evolution in search engines: speech recognition will help improve search accuracy by filling the gap between verbal and written communication.
  • Impact on the healthcare industry: speech recognition is becoming a common feature in the medical sector by aiding the completion of medical reporting. As VUIs become better at understanding medical jargon, adopting this technology will free up time away from administrative work for doctors.
  • Service industry: In the increasing trends of automation, it may be the case that a customer cannot get a human to respond to a query, and thus, speech recognition systems can fill this gap. We will see the rapid growth of this feature in airports, public transit, etc.
  • Service providers: telecommunication providers may rely even more on speech to text-based systems that can reduce wait times by helping establish caller’s demands and directing them to the appropriate assistance.  

Also Read: Voice Search Technology – Interesting Facts

Conclusion

Speech to text is a powerful technology that will soon be ubiquitous. Its reasonably straightforward usability in conjunction with Python (one of the most popular programming languages in the world) makes creating its applications easier. As we make strides in this field, we are paving the path to a world where access to the digital world is not just fingertipped away but also a spoken word.

If you are interested to know more about natural language processing, check out our PG Diploma in Machine Learning and AI program which is designed for working professionals and more than 450 hours of rigorous training.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Enroll Now @ upGrad

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Accelerate Your Career with upGrad

×