Lightnews — Scholar-powered news

Mariusbs

@mariubsbs.bsky.social

#californiahousing #machinelearning #ai #python #data #visualization #ML #neuralnetworks

April 13, 2025 at 8:05 PM

Mariusbs

@mariubsbs.bsky.social

To enhance performance, consider feature engineering and transforming the data, also try nonlinear models like Random Forest, Gradient Boosting, or XGBoost to better capture complex patterns in the data

April 13, 2025 at 8:05 PM

Mariusbs

@mariubsbs.bsky.social

The model shows balanced performance across training and test sets, with no signs of overfitting or underfitting. An R² around 57–61% is acceptable for a linear regression on real-world data, though a MAPE above 30% indicates that prediction accuracy could be improved.

April 13, 2025 at 8:05 PM

Mariusbs

@mariubsbs.bsky.social

MAPE 31.95%: The model's predictions deviate by roughly 32% from actual values on average, relative to the true prices.
R² 0.576: The model explains about 57.6% of the variation in housing prices, suggesting a moderate fit to the data.

April 13, 2025 at 8:05 PM

Mariusbs

@mariubsbs.bsky.social

MAE 0.533: On average, the model's predictions are off by approximately $53,300 from the actual house values.
RMSE 0.746: The typical prediction error, giving more weight to larger errors, is around $74,600, indicating some variability in residuals.

April 13, 2025 at 8:05 PM

Mariusbs

@mariubsbs.bsky.social

A linear regression model is then trained on the dataset. NB, this is just a fast project with a lack of feature engineering. This returns the following metrics:

April 13, 2025 at 8:04 PM

Mariusbs

@mariubsbs.bsky.social

Geospatial visualizations are a nice feature within the matplotlib library. Easy to see where the most popular houses reside

April 13, 2025 at 8:04 PM

Mariusbs

@mariubsbs.bsky.social

On features, we could do feature engineering like removing AvgRooms and AveBdrms to remove redundancy, and add new features like 'RoomsPerPerson'

April 13, 2025 at 8:03 PM

Mariusbs

@mariubsbs.bsky.social

House age has a weak positive correlation with house value, implying that newer or older homes don't consistently drive prices up or down—other factors likely play a larger role in determining value

April 13, 2025 at 8:03 PM

Mariusbs

@mariubsbs.bsky.social

The average number of rooms and bedrooms per household are highly correlated, suggesting that larger homes tend to have more bedrooms, which could signal redundancy if both features are used in a predictive model

April 13, 2025 at 8:03 PM

Mariusbs

@mariubsbs.bsky.social

The dataset can be directly downloaded in Python

Median Income has the strongest positive correlation (0.69) with Median House Value, indicating that areas where residents earn more typically have more expensive housing

April 13, 2025 at 8:03 PM

Mariusbs

@mariubsbs.bsky.social

Population: Population in the district
AveOccup: Average household occupancy (number of people per household)
Latitude: Geographic latitude of the district
Longitude: Geographic longitude of the district

April 13, 2025 at 8:02 PM

Mariusbs

@mariubsbs.bsky.social

The goal is to predict house value based on these features:

MedInc: Median income in the district (in tens of thousands USD)
HouseAge: Median age of houses in the district
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household

April 13, 2025 at 8:02 PM

Mariusbs

@mariubsbs.bsky.social

#DataVisualization #IrisDataset #MLforBeginners #ConfusionMatrix
#MachineLearning #DataScience #ML #AI#Python #scikitlearn #Pandas #Seaborn #Matplotlib

April 8, 2025 at 8:16 PM

Mariusbs

@mariubsbs.bsky.social

The results are later saved to a DataFrame, and the end result

April 8, 2025 at 8:16 PM

Mariusbs

@mariubsbs.bsky.social

Each row is true label, each column is predicted
Off-diagonal values are mis-predictions, but the model returned 100% accuracy.
This likely happens because Setosa is linearly separable, and the model successfully learned subtle feature differences between Versicolor/Virginica

April 8, 2025 at 8:15 PM

Mariusbs

@mariubsbs.bsky.social

Logistic regression fits the Iris dataset well.
The feature-class relationships are mostly linear, and the data is clean.

Model training below:

April 8, 2025 at 8:15 PM

Mariusbs

@mariubsbs.bsky.social

As seen above, Setosa (0) is linearly separable using petal features. Versicolor (1) and Virginica (2) overlap more, especially on sepal dimensions

April 8, 2025 at 8:14 PM

Mariusbs

@mariubsbs.bsky.social

Pairplot is clever for visualization of how the flowers differ in both width and length for petal and sepal. Petal and Sepal are distinct parts of a flower - here used for measurements. We can see how flowers vary in measurements. 0, 1 and 2 as categorical variables for flowers

April 8, 2025 at 8:14 PM

Mariusbs

@mariubsbs.bsky.social

As the dataset is already clean, we skip this part and jump straight to exploratory data analysis (often shortened with EDA)

Note: Higher std for petal, suggesting a larger spread of values

April 8, 2025 at 8:14 PM

Mariusbs

@mariubsbs.bsky.social

The Iris dataset contains 150 rows of flower measurements across three species: Setosa, Versicolor, and Virginica. Each row includes four numeric features: sepal length, sepal width, petal length, and petal width.

Purpose is to classify different flowers based on 'attributes'

April 8, 2025 at 8:13 PM

Mariusbs

@mariubsbs.bsky.social

#python #silver #vscode #linearregression #ML #correlation #gold #copper #finance

April 3, 2025 at 8:51 PM

Mariusbs

@mariubsbs.bsky.social

Just a solo freetime project. Linear regression with nice metrics, but not production-grade. No real use-case, no future forecasting. Built more for exploration than insight. Might add more regression models or forest models in the thread

April 3, 2025 at 8:33 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news