The above article is one of many. Covid-19 has impacted our lives greatly but more so has impacted source of income of many, as if things were already not difficult.
As an engineering student currently in my 2nd year, I would be sitting for my internships soon, it would be great to have an idea what skills are trending in tech industry, to boost my chances of getting good internships and eventually a good job, in this project, I predict salaries based on the skills, company names, requirements and rating of the company posted in indeed (best job searching site), for which I scraped the site using BeautifulSoup library.
First, there are a lot of missing values, especially of the target variable<br><br>
<br><br>
Lets look at salary distribution<br><br>
<br>
Clearly the salary distribution is not uniform with most annual salries below Rs.1000000 and a few high salaries<br><br>
Lets look at income categories where salaries fall in<br><br>
<br><br>
Most annual incomes are in the range of 1 to 5 lpa<br>
As observed from the income category distribution and the avg_annual_sal distribution the salary distribution is really skewed, as most of the people are payed near the average which is pretty low and only a few people get really high salaries<br>
Lets Look at the correlation b/w some of these these variables<br><br>
<br><br>
Some correlations are quite noticable while others are quite weakly related to average salary
Lets look at average salary vs ratings<br><br>
<br><br>
Higher rated companies generally pay higher with few exceptions (Most of the higher rated companies have not stated the offering salary beforehand, that could be one reason)<br>
Lets look at average salary vs ratings<br><br>
<br>
Comapanies tend to pay higher to more experienced employees<br>
We have talked about how various factors relate to annual salary<br><br>
Lets now look at most mentioned skills in the requirements section by recruiting companies<br><br>
<br><br>
From the wordcloud we can see some of the trending skills in software industry<br>
Looks like most jobs are for front end, most popular framework is .net and the most asked for programming language is python, php and java<br>
Plot for frequency of a particular skill ocurring in requirements column<br>
<br><br>
Now lets look at average salary wrt to job role<br><br>
<br><br>
(i) Most salaries are below Rs.50000 <br>
(ii) The highest offered salary is of Rs.5285450 by Jobsrefer<br>
(iii) A Company even pays an annual salary of just Rs. 6500 !!<br><br>
Lets look at states having the highest job openings at the time of data collected<br><br>
<br><br>
Most job openings are in Delhi, followed by Karnataka<br><br>
Lets now look at top 10 companies offering highest salaries wrt seniority<br><br>
<br><br>
As expected Companies offer high paying salries to senior employees<br>
Looks like most of the missing job_titles for above companies are probably senior<br><br>
First started with basic regression models like <b>Lasso</b> as the data has outliers and lasso is robust to outliers.
Also some really powerful models like <b>Random forest</b> and <b>Xgboost</b> models were used as the complexity of problem is high but the available data is small. (784 training and 100 test examples) <br>
Results
To run the notebook unzip the all_models zip in the folder all_trained_models.
As the dataset was quite small, RandomForest was used to generate the feature importances of variables to get an idea of how useful our variables are in predicting target values
Following is the plot for top 10 useful features according to RandomForest
<br>
The features are quite weakly related to the target values.
Lets now look at the performance of various models (Complexity increases down the list)