IT Skills

Data Analysis with Python





1. Basics of Data Analysis with Python?

  • Libraries to Use:
  • pandas: Data manipulation and analysis.
  • numpy: Numerical computations.
  • matplotlib & seaborn: Data visualization.
  • scipy & statsmodels: Statistical analysis.
  • scikit-learn: Machine learning for analysis.

2. Common Data Analysis Tasks & Examples

Loading Data?

```python import pandas as pd

Read CSV file

data = pd.read_csv("data.csv") print(data.head()) Display first 5 rows ```

Exploring Data

  • View dimensions: data.shape (rows, columns).
  • Data types: data.dtypes.
  • Descriptive stats: data.describe(). python print(data.info()) Overview of columns print(data['Age'].mean()) Calculate mean of Age column

Cleaning Data

  • Handling missing values: python data.dropna(inplace=True) Remove rows with missing data data.fillna(0, inplace=True) Replace missing values with 0
  • Renaming or dropping columns: python data.rename(columns={"OldName": "NewName"}, inplace=True) data.drop("UnwantedColumn", axis=1, inplace=True)

Filtering and Sorting

```python

Filter rows where Age > 30

filtered_data = data[data['Age'] > 30]

Sort by a column

sorted_data = data.sort_values(by='Salary', ascending=False) ```

Aggregation

```python

Group by and calculate the mean

grouped_data = data.groupby('Department')['Salary'].mean() print(grouped_data) ```


3. Visualizing Data?

Basic Visualizations

```python import matplotlib.pyplot as plt import seaborn as sns

Line plot

data['Sales'].plot(kind='line')

Scatter plot

sns.scatterplot(x='Age', y='Salary', data=data)

Histogram

data['Salary'].plot(kind='hist', bins=10)

plt.show() ```

Advanced Visualizations

```python

Heatmap of correlations

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')

Box plot

sns.boxplot(x='Department', y='Salary', data=data) ```


4. Common Formulas in Data Analysis

  • Mean: data['column_name'].mean()
  • Median: data['column_name'].median()
  • Standard Deviation: data['column_name'].std()
  • Correlation: data.corr()
  • Linear Regression (via scipy): python from scipy.stats import linregress slope, intercept, r_value, p_value, std_err = linregress(data['X'], data['Y']) print(f"Slope: {slope}, R-squared: {r_value**2}")

5. Specific Situations

Analyzing Sales Data

  • Use groupby to analyze sales trends by region or product. python sales_by_region = data.groupby('Region')['Sales'].sum() print(sales_by_region)

Detecting Outliers

  • Box plots or statistical methods like the IQR (Interquartile Range). python Q1 = data['Salary'].quantile(0.25) Q3 = data['Salary'].quantile(0.75) IQR = Q3 - Q1 outliers = data[(data['Salary'] < Q1 - 1.5 * IQR) | (data['Salary'] > Q3 + 1.5 * IQR)]

Customer Segmentation?

  • Cluster customers by spending habits using scikit-learn. ```python from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3) data['Cluster'] = kmeans.fit_predict(data[['Spending', 'Visits']]) ```

Time Series Analysis

  • Analyze trends or forecast future data using statsmodels or pandas. python data['Date'] = pd.to_datetime(data['Date']) data.set_index('Date', inplace=True) data['Sales'].plot()

6. Resources to Practice?


If you liked this, consider supporting us by checking out Tiny Skills - 250+ Top Work & Personal Skills Made Easy