Basic Knowledge of Data Science: What to Learn for a Beginner

From WikiOD

The field of data science is now popular and is everywhere - from predicting the demand for products in the store to driving autonomous vehicles. The concept is broad and includes mathematics, statistics, programming, and machine learning. Experts in the field analyze large amounts of data to find relationships and make predictions.

What do you need to know the minimum for such work? In general, all of the following skills can be learned on google or you can find courses on many other platforms, for example, there are mini-courses on Kaggle.

Python[edit | edit source]

A programming language that is very popular now and available in teaching. You need to know the basics of the language, functions, data structures, classes (OOP). You will also need Jupyter Notebook and Git. Additional libraries can be installed relatively quickly and easily (difficulties may arise for the most recent versions of Python).

Course: kaggle

Numpy / Scipy[edit | edit source]

You need to know the basics of these libraries and study deeper as needed.

If you need to work with sparse tables, use a vectorized approach, or need performance when calculating large arrays, then functions from Numpy are best suited. True, sometimes using Numpy / Scipy requires converting data into native array types (ndarray). Many Python libraries for Data Science depend on Numpy (like Sklearn). It is often convenient to use the math functions built into Numpy in your work because they are well optimized for numerical computations with matrices.

Pandas[edit | edit source]

The Pandas library is always needed because the data needs to be read and processed. Pandas is an open-source, powerful and easy-to-use tool. Also, using Pandas, you can make complex queries from several tables (files/databases) to create a new one of the desired type - for further training the model or for visualization. Pandas is often used for initial data processing, mathematical and statistical calculations.

Almost every company has a database, and if you need to get data directly from a SQL database, then knowing the basics of SQL queries that can be executed from a Python script will come in handy.

Course: Pandas

Matplotlib / Seaborn[edit | edit source]

And, of course, where can we go without data visualization! For this, Matplotlib is widely used to plot any type of graphs. Matplotlib is useful for its own graphs that are computed. If you need to build a graph from ready-made data and do not want to customize the display style in 10-20 lines, then you can use Seaborn and do the same in 1-2 lines and with the already configured style. A visual graph can help solve a problem or discuss it with customers.

Sklearn[edit | edit source]

To create analytical models, you will need knowledge of traditional machine learning (ML) algorithms: linear and logistic regression, decision tree, clustering.

Perhaps the most popular ML library for Python is scikit-learn (sklearn). The documentation is very detailed and clear, all APIs in the library are relatively simple and easy to use. For example, to train a model, use clf.fit(X, y), then predict - clf.predict(y_test). You can also use Pipeline to reuse all of the data processing steps. You can even use elements from deep learning (artificial neural networks) - Multi-layer Perceptron (MLP).

Conclusion[edit | edit source]

A minimum of knowledge is necessary at the initial stage, but then everything will become more difficult and you need to acquire new knowledge or deepen the existing knowledge in a specific area. It will also be useful to take a course on algorithms because at interviews, competitions and sometimes at work, they come in handy.