Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconHow To Convert Speech to Text with Python [Step-by-Step Process]

How To Convert Speech to Text with Python [Step-by-Step Process]

Last updated:
7th Aug, 2020
Read Time
9 Mins
share image icon
In this article
Chevron in toc
View All
How To Convert Speech to Text with Python [Step-by-Step Process]

Introduction to Speech to Text

We are living in an age where the ways we interact with machines have become varied and complex. We have evolved from chunky mechanical buttons to the touchscreen interface. But this evolution is not limited to hardware. The status quo for input for computers has been text since conception. Still, with advancements in NLP (Natural Language Processing) and ML (Machine Learning), Data Science we have the tools to incorporate speech as a medium to interact with our gadgets.

Top Machine Learning and AI Courses Online

These tools already surround us and serve us most commonly as virtual assistants. Google, Siri, Alexa, etc. are milestone achievements in adding another more personal and convenient dimension of interacting with the digital world.

Unlike most technological innovations, speech to text technology is available for everyone to explore, both for consumption and to build your projects. 

Ads of upGrad blog

Python is one of the most common programming languages in the world has tools to create your speech to text applications.

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

History of Speech to Text

Before we explore statement to text in Python, it’s worthwhile to appreciate how much progress we have made in this field. The following is the simplified timeline of the :

  • Audrey,1952: the first speech recognition system developed by 3 Bells labs researchers. It could only recognize digits.
  • IBM Showbox (1962): IBM’s first speech recognition system that coils recognize 16 words in addition to digits. Could solve simple arithmetic dictations and print the result.
  • Defense Advanced Research Projects Agency(DARPA) (1970): DARPA funded the Speech Understanding Research, which led to Harpy’s development to recognize 1011 words.
  • Hidden Markov Model(HMM), the 1980s: HMM is a statistical model that models problems requiring sequential information. This model was applied to further advancements in speech recognition. 
  • Voice search by Google,2001: Google introduced the Voice Search feature that enabled users to search using speech. This was the first voice-enabled application that became very popular.
  • Siri,2011: Apple introduced Siri that was able to perform a real-time and convenient way to interact with its devices.
  • Alexa,2014 & google home,2016: Voice command based virtual assistants became mainstream as google home and Alexa collectively sell over 150 million units.

Also Read: Top 7 Python NLP Libraries

Challenges in a Speech to Text 

Speech to text is still a complex problem that is far from being a truly finished product. Several technical difficulties make this an imperfect tool at best. The following are the common challenges with speech recognition technology:

1. Imprecise interpretation

Speech recognition doesn’t always interpret spoken words correctly. VUIs(Voice User Interface) is not as adept as humans in the understanding context that change the relationship between words and sentences. Machines thus may struggle to understand the semantics of a sentence.

FYI: Free nlp online course!

2. Time

Sometimes, it takes too long for voice recognition systems to process. This may be owing to the diversity of voice patterns that humans possess. Such difficulty in voice recognition can be avoided by slowing down speech or being more precise in pronunciation, which takes away from the tool’s convenience.

3. Accents

VUIs may find it hard to comprehend dialects that differ from the average. Within the same language, speakers can have wildly different ways of speaking the same words. 

4. Background noise and loudness

In an ideal world, these won’t be a problem, but that’s simply not the case, and so VUIs may find it challenging to work in loud environments (public spaces, big offices, etc.).

Must Read: How to make a chatbot in Python

Speech to Text in Python

If one doesn’t want to go through the arduous process of building a statement to text from the ground up, use the following as a guide. This guide is merely a basic introduction to creating your very own speech to text application. Make sure you do have a functioning microphone in addition to a relatively recent version of Python.

Step 1:

Download the following python packages:

  • speech_recogntion (pip install SpeechRecogntion): This is the main package that runs the most crucial step of converting speech to text. Other alternatives have pros and cons, such as appeal, assembly, google-cloud-search, pocketsphinx, Watson-developer-cloud, wit, etc.
  • My audio (pip install Pyaudio)
  • Portaudio (pip install Portaudio)

Step 2:

Create a project (name it whatever you want), and import the speech_recogntion as sr.

Create as many instances of the recognizer class.

Step 3:

Once you have created these instances, we now have to define the source of the input.

For now, let’s define the source as the microphone itself (you could use an existing audio file)

Step 4:

We will now define a variable to store the input. We use the ‘listen’ method to take information from the source. So, in our case, we will use the microphone as a source that we established in the previous line of code.

Step 5:

Now that we have the input(microphone as source) defined and have it stored in a variable(‘audio’) we simply have to use the recognize_google method to convert it into text. We may store the result in a variable or can simply print the result. We do not have to rely solely on recognize_google, we have other methods that use different APIs that work as well. Examples of such methods are:





recongize_Sphinx() (works offline too)

The following method used existing packages that help cut down on having to develop your speech to text recognizing software from scratch. These packages have more tools that can help you build your projects that solve more specific problems. One example of a useful feature is that you may change the default language from English to say Hindi. This will change the results that are printed into Hindi ( although as it currently stands, speech to text is most developed to understand English ).

But, it’s a good thought exercise of severe developers to understand how such software runs.

Let’s break it down.

At its most fundamental, speech is simply a sound wave. Such sound waves or audio signals have a few characteristic properties (that may seem familiar to the physics of acoustics) such as Amplitude, crest and trough, wavelength, cycle, and frequency.

Such audio signals are continuous and thus have infinite data points. To convert such an audio signal into a digital signal, such that a computer may process it, the network must take a discrete distribution of samples that closely resembles the continuity of an audio signal.

Once we have an appropriate sampling frequency (8000 Hz is a good standard as most speech frequencies are in this range ), we can now Python libraries such as LibROSA and SciPy process the audio signals. We can then build on these inputs by splitting the data set into 2, training the model, and the other to validate the model’s findings.

At this stage, one may use the model architecture of Conv1d, a convolutional neural network that performs along only one dimension. We can then build a model, define its loss function, and using neural networks to save the best model from converting speech to text. Using deep learning and NLP( Natural Language Processing ), we can refine statement to text for more extensive applications and adoption. 

Also Read: Voice Search Technology – Interesting Facts

Applications of Speech Recognition

As we have learned, the tools to run this technological innovation are more accessible because this is mostly a software innovation, and no one company owns it. This accessibility has opened doors for developers of limited resources to come up with their application of this technology.

Some of the fields in which speech recognition is growing are as follows:

  • Evolution in search engines: speech recognition will help improve search accuracy by filling the gap between verbal and written communication.
  • Impact on the healthcare industry: speech recognition is becoming a common feature in the medical sector by aiding the completion of medical reporting. As VUIs become better at understanding medical jargon, adopting this technology will free up time away from administrative work for doctors.
  • Service industry: In the increasing trends of automation, it may be the case that a customer cannot get a human to respond to a query, and thus, speech recognition systems can fill this gap. We will see the rapid growth of this feature in airports, public transit, etc.
  • Service providers: telecommunication providers may rely even more on speech to text-based systems that can reduce wait times by helping establish caller’s demands and directing them to the appropriate assistance.  
Ads of upGrad blog

Popular AI and ML Blogs & Free Courses


Speech to text is a powerful technology that will soon be ubiquitous. Its reasonably straightforward usability in conjunction with Python (one of the most popular programming languages in the world) makes creating its applications easier. As we make strides in this field, we are paving the path to a world where access to the digital world is not just fingertipped away but also a spoken word.

If you are interested to know more about natural language processing, check out our Executive PG in Machine Learning and AI program which is designed for working professionals and more than 450 hours of rigorous training.

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Programme in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.


Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What is speech to text conversion?

In the early days of speech recognition, a transcriptionist sat with a headset and recorded speech. The process took a long time and produced low quality transcripts. Today, speech recognition systems use computers to convert speech to text. This is called speech-to-text conversion. Speech recognition (also known as speech-to-text conversion) is the process of converting spoken words into machine readable data. The purpose is to allow people to communicate with machines by voice and to enable machines to communicate with people by producing speech. Speech-to-text software is used to perform this conversion.

2What are the challenges in speech to text conversion?

There are many challenges in speech to text conversion. The main challenges are: Accuracy, where the system has to get the spoken words right in order to extract the user intent. Speed, the system needs to be able to perform the above fast enough to be acceptable to the user. Naturalness, the system should sound as natural as possible, so the user doesn't feel that they have to speak in an unnatural manner. Robustness, the system should be able to handle a large amount of background noise, other speech and any other effects that may interfere with the conversion process.

3What are the applications of speech to text processing?

The reason why you need to convert speech into text is because it is a very fast and convenient way to communicate. The speech to text processing can be used in many different applications, for example, it can be used in a mobile communication device, where the user can use his speech to send messages and make calls instead of typing on the keyboard. Another application of speech to text processing is machine control. It is a way of controlling an engine or other industrial machine by speaking to it.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon