Mariusbs
mariubsbs.bsky.social
Mariusbs
@mariubsbs.bsky.social
Student @ MsC Business analytics på heltid, rådgiver på deltid :)

Interesser: geopolitikk, bøker, kosthold, børs og finans, ai + ml/dl etc. følg meg gjerne!
To enhance performance, consider feature engineering and transforming the data, also try nonlinear models like Random Forest, Gradient Boosting, or XGBoost to better capture complex patterns in the data
April 13, 2025 at 8:05 PM
The model shows balanced performance across training and test sets, with no signs of overfitting or underfitting. An R² around 57–61% is acceptable for a linear regression on real-world data, though a MAPE above 30% indicates that prediction accuracy could be improved.
April 13, 2025 at 8:05 PM
MAPE 31.95%: The model's predictions deviate by roughly 32% from actual values on average, relative to the true prices.
R² 0.576: The model explains about 57.6% of the variation in housing prices, suggesting a moderate fit to the data.
April 13, 2025 at 8:05 PM
MAE 0.533: On average, the model's predictions are off by approximately $53,300 from the actual house values.
RMSE 0.746: The typical prediction error, giving more weight to larger errors, is around $74,600, indicating some variability in residuals.
April 13, 2025 at 8:05 PM
A linear regression model is then trained on the dataset. NB, this is just a fast project with a lack of feature engineering. This returns the following metrics:
April 13, 2025 at 8:04 PM
Geospatial visualizations are a nice feature within the matplotlib library. Easy to see where the most popular houses reside
April 13, 2025 at 8:04 PM
On features, we could do feature engineering like removing AvgRooms and AveBdrms to remove redundancy, and add new features like 'RoomsPerPerson'
April 13, 2025 at 8:03 PM
House age has a weak positive correlation with house value, implying that newer or older homes don't consistently drive prices up or down—other factors likely play a larger role in determining value
April 13, 2025 at 8:03 PM
The average number of rooms and bedrooms per household are highly correlated, suggesting that larger homes tend to have more bedrooms, which could signal redundancy if both features are used in a predictive model
April 13, 2025 at 8:03 PM
The dataset can be directly downloaded in Python

Median Income has the strongest positive correlation (0.69) with Median House Value, indicating that areas where residents earn more typically have more expensive housing
April 13, 2025 at 8:03 PM
Population: Population in the district
AveOccup: Average household occupancy (number of people per household)
Latitude: Geographic latitude of the district
Longitude: Geographic longitude of the district
April 13, 2025 at 8:02 PM
The goal is to predict house value based on these features:

MedInc: Median income in the district (in tens of thousands USD)
HouseAge: Median age of houses in the district
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household
April 13, 2025 at 8:02 PM
The results are later saved to a DataFrame, and the end result
April 8, 2025 at 8:16 PM
Each row is true label, each column is predicted
Off-diagonal values are mis-predictions, but the model returned 100% accuracy.
This likely happens because Setosa is linearly separable, and the model successfully learned subtle feature differences between Versicolor/Virginica
April 8, 2025 at 8:15 PM
Logistic regression fits the Iris dataset well.
The feature-class relationships are mostly linear, and the data is clean.

Model training below:
April 8, 2025 at 8:15 PM
As seen above, Setosa (0) is linearly separable using petal features. Versicolor (1) and Virginica (2) overlap more, especially on sepal dimensions
April 8, 2025 at 8:14 PM
Pairplot is clever for visualization of how the flowers differ in both width and length for petal and sepal. Petal and Sepal are distinct parts of a flower - here used for measurements. We can see how flowers vary in measurements. 0, 1 and 2 as categorical variables for flowers
April 8, 2025 at 8:14 PM
As the dataset is already clean, we skip this part and jump straight to exploratory data analysis (often shortened with EDA)

Note: Higher std for petal, suggesting a larger spread of values
April 8, 2025 at 8:14 PM
The Iris dataset contains 150 rows of flower measurements across three species: Setosa, Versicolor, and Virginica. Each row includes four numeric features: sepal length, sepal width, petal length, and petal width.

Purpose is to classify different flowers based on 'attributes'
April 8, 2025 at 8:13 PM
Just a solo freetime project. Linear regression with nice metrics, but not production-grade. No real use-case, no future forecasting. Built more for exploration than insight. Might add more regression models or forest models in the thread
April 3, 2025 at 8:33 PM